doc: Add concept page on submitting I/O to an NVMe device
To be referenced in an upcoming blog post. Change-Id: Ib1669b3223257d1c1978e8664edf9b1e078918a3 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/c/spdk/spdk/+/453453 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>
This commit is contained in:
parent
30af9d3798
commit
c2c0ece7a2
@ -823,6 +823,7 @@ INPUT += \
|
||||
notify.md \
|
||||
nvme.md \
|
||||
nvme-cli.md \
|
||||
nvme_spec.md \
|
||||
nvmf.md \
|
||||
nvmf_tgt_pg.md \
|
||||
nvmf_tracing.md \
|
||||
|
@ -4,6 +4,7 @@
|
||||
- @subpage memory
|
||||
- @subpage concurrency
|
||||
- @subpage ssd_internals
|
||||
- @subpage nvme_spec
|
||||
- @subpage vhost_processing
|
||||
- @subpage overview
|
||||
- @subpage porting
|
||||
|
123
doc/nvme_spec.md
Normal file
123
doc/nvme_spec.md
Normal file
@ -0,0 +1,123 @@
|
||||
# Submitting I/O to an NVMe Device {#nvme_spec}
|
||||
|
||||
## The NVMe Specification
|
||||
|
||||
The NVMe specification describes a hardware interface for interacting with
|
||||
storage devices. The specification includes network transport definitions for
|
||||
remote storage as well as a hardware register layout for local PCIe devices.
|
||||
What follows here is an overview of how an I/O is submitted to a local PCIe
|
||||
device through SPDK.
|
||||
|
||||
NVMe devices allow host software (in our case, the SPDK NVMe driver) to allocate
|
||||
queue pairs in host memory. The term "host" is used a lot, so to clarify that's
|
||||
the system that the NVMe SSD is plugged into. A queue pair consists of two
|
||||
queues - a submission queue and a completion queue. These queues are more
|
||||
accurately described as circular rings of fixed size entries. The submission
|
||||
queue is an array of 64 byte command structures, plus 2 integers (head and tail
|
||||
indices). The completion queue is similarly an array of 16 byte completion
|
||||
structures, plus 2 integers (head and tail indices). There are also two 32-bit
|
||||
registers involved that are called doorbells.
|
||||
|
||||
An I/O is submitted to an NVMe device by constructing a 64 byte command, placing
|
||||
it into the submission queue at the current location of the submission queue
|
||||
head index, and then writing the new index of the submission queue head to the
|
||||
submission queue head doorbell register. It's actually valid to copy a whole set
|
||||
of commands into open slots in the ring and then write the doorbell just one
|
||||
time to submit the whole batch.
|
||||
|
||||
There is a very detailed description of the command submission and completion
|
||||
process in the NVMe specification, which is conveniently available from the main
|
||||
page over at [NVM Express](https://nvmexpress.org).
|
||||
|
||||
Most importantly, the command itself describes the operation and also, if
|
||||
necessary, a location in host memory containing a descriptor for host memory
|
||||
associated with the command. This host memory is the data to be written on a
|
||||
write command, or the location to place the data on a read command. Data is
|
||||
transferred to or from this location using a DMA engine on the NVMe device.
|
||||
|
||||
The completion queue works similarly, but the device is instead the one writing
|
||||
entries into the ring. Each entry contains a "phase" bit that toggles between 0
|
||||
and 1 on each loop through the entire ring. When a queue pair is set up to
|
||||
generate interrupts, the interrupt contains the index of the completion queue
|
||||
head. However, SPDK doesn't enable interrupts and instead polls on the phase
|
||||
bit to detect completions. Interrupts are very heavy operations, so polling this
|
||||
phase bit is often far more efficient.
|
||||
|
||||
## The SPDK NVMe Driver I/O Path
|
||||
|
||||
Now that we know how the ring structures work, let's cover how the SPDK NVMe
|
||||
driver uses them. The user is going to construct a queue pair at some early time
|
||||
in the life cycle of the program, so that's not part of the "hot" path. Then,
|
||||
they'll call functions like spdk_nvme_ns_cmd_read() to perform an I/O operation.
|
||||
The user supplies a data buffer, the target LBA, and the length, as well as
|
||||
other information like which NVMe namespace the command is targeted at and which
|
||||
NVMe queue pair to use. Finally, the user provides a callback function and
|
||||
context pointer that will be called when a completion for the resulting command
|
||||
is discovered during a later call to spdk_nvme_qpair_process_completions().
|
||||
|
||||
The first stage in the driver is allocating a request object to track the operation. The
|
||||
operations are asynchronous, so it can't simply track the state of the request
|
||||
on the call stack. Allocating a new request object on the heap would be far too
|
||||
slow, so SPDK keeps a pre-allocated set of request objects inside of the NVMe
|
||||
queue pair object - `struct spdk_nvme_qpair`. The number of requests allocated to
|
||||
the queue pair is larger than the actual queue depth of the NVMe submission
|
||||
queue because SPDK supports a couple of key convenience features. The first is
|
||||
software queueing - SPDK will allow the user to submit more requests than the
|
||||
hardware queue can actually hold and SPDK will automatically queue in software.
|
||||
The second is splitting. SPDK will split a request for many reasons, some of
|
||||
which are outlined next. The number of request objects is configurable at queue
|
||||
pair creation time and if not specified, SPDK will pick a sensible number based
|
||||
on the hardware queue depth.
|
||||
|
||||
The second stage is building the 64 byte NVMe command itself. The command is
|
||||
built into memory embedded into the request object - not directly into an NVMe
|
||||
submission queue slot. Once the command has been constructed, SPDK attempts to
|
||||
obtain an open slot in the NVMe submission queue. For each element in the
|
||||
submission queue an object called a tracker is allocated. The trackers are
|
||||
allocated in an array, so they can be quickly looked up by an index. The tracker
|
||||
itself contains a pointer to the request currently occupying that slot. When a
|
||||
particular tracker is obtained, the command's CID value is updated with the
|
||||
index of the tracker. The NVMe specification provides that CID value in the
|
||||
completion, so the request can be recovered by looking up the tracker via the
|
||||
CID value and then following the pointer.
|
||||
|
||||
Once a tracker (slot) is obtained, the data buffer associated with it is
|
||||
processed to build a PRP list. That's essentially an NVMe scatter gather list,
|
||||
although it is a bit more restricted. The user provides SPDK with the virtual
|
||||
address of the buffer, so SPDK has to go do a page table look up to find the
|
||||
physical address (pa) or I/O virtual addresses (iova) backing that virtual
|
||||
memory. A virtually contiguous memory region may not be physically contiguous,
|
||||
so this may result in a PRP list with multiple elements. Sometimes this may
|
||||
result in a set of physical addresses that can't actually be expressed as a
|
||||
single PRP list, so SPDK will automatically split the user operation into two
|
||||
separate requests transparently. For more information on how memory is managed,
|
||||
see @ref memory.
|
||||
|
||||
The reason the PRP list is not built until a tracker is obtained is because the
|
||||
PRP list description must be allocated in DMA-able memory and can be quite
|
||||
large. Since SPDK typically allocates a large number of requests, we didn't want
|
||||
to allocate enough space to pre-build the worst case scenario PRP list,
|
||||
especially given that the common case does not require a separate PRP list at
|
||||
all.
|
||||
|
||||
Each NVMe command has two PRP list elements embedded into it, so a separate PRP
|
||||
list isn't required if the request is 4KiB (or if it is 8KiB and aligned
|
||||
perfectly). Profiling shows that this section of the code is not a major
|
||||
contributor to the overall CPU use.
|
||||
|
||||
With a tracker filled out, SPDK copies the 64 byte command into the actual NVMe
|
||||
submission queue slot and then rings the submission queue tail doorbell to tell
|
||||
the device to go process it. SPDK then returns back to the user, without waiting
|
||||
for a completion.
|
||||
|
||||
The user can periodically call `spdk_nvme_qpair_process_completions()` to tell
|
||||
SPDK to examine the completion queue. Specifically, it reads the phase bit of
|
||||
the next expected completion slot and when it flips, looks at the CID value to
|
||||
find the tracker, which points at the request object. The request object
|
||||
contains a function pointer that the user provided initially, which is then
|
||||
called to complete the command.
|
||||
|
||||
The `spdk_nvme_qpair_process_completions()` function will keep advancing to the
|
||||
next completion slot until it runs out of completions, at which point it will
|
||||
write the completion queue head doorbell to let the device know that it can use
|
||||
the completion queue slots for new completions and return.
|
Loading…
Reference in New Issue
Block a user