Add documentation explaining memory management
Change-Id: Ifce9507fc327ee090d4a825323df928d440fe025 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/362273 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
This commit is contained in:
parent
da0dfddb35
commit
4ee35e1f35
@ -783,6 +783,7 @@ WARN_LOGFILE =
|
|||||||
INPUT = ../include/spdk \
|
INPUT = ../include/spdk \
|
||||||
index.md \
|
index.md \
|
||||||
directory_structure.md \
|
directory_structure.md \
|
||||||
|
memory.md \
|
||||||
porting.md \
|
porting.md \
|
||||||
blob.md \
|
blob.md \
|
||||||
blobfs.md \
|
blobfs.md \
|
||||||
|
12
doc/index.md
12
doc/index.md
@ -5,15 +5,17 @@
|
|||||||
- [SPDK on GitHub](https://github.com/spdk/spdk/)
|
- [SPDK on GitHub](https://github.com/spdk/spdk/)
|
||||||
- [SPDK.io](http://www.spdk.io/)
|
- [SPDK.io](http://www.spdk.io/)
|
||||||
|
|
||||||
The Storage Performance Development Kit (SPDK) provides a set of tools and libraries
|
The Storage Performance Development Kit (SPDK) provides a set of tools and
|
||||||
for writing high performance, scalable, user-mode storage applications.
|
libraries for writing high performance, scalable, user-mode storage
|
||||||
It achieves high performance by moving all of the necessary drivers
|
applications. It achieves high performance by moving all of the necessary
|
||||||
into userspace and operating in a polled mode instead of relying on interrupts,
|
drivers into userspace and operating in a polled mode instead of relying on
|
||||||
which avoids kernel context switches and eliminates interrupt handling overhead.
|
interrupts, which avoids kernel context switches and eliminates interrupt
|
||||||
|
handling overhead.
|
||||||
|
|
||||||
## General Information {#general}
|
## General Information {#general}
|
||||||
|
|
||||||
- @ref directory_structure
|
- @ref directory_structure
|
||||||
|
- @ref memory
|
||||||
- @ref porting
|
- @ref porting
|
||||||
- [Public API header files](files.html)
|
- [Public API header files](files.html)
|
||||||
|
|
||||||
|
118
doc/memory.md
Normal file
118
doc/memory.md
Normal file
@ -0,0 +1,118 @@
|
|||||||
|
# Memory Management for User Space Drivers {#memory}
|
||||||
|
|
||||||
|
The following is an attempt to explain why all data buffers passed to SPDK must
|
||||||
|
be allocated using spdk_dma_malloc() or its siblings, and why SPDK relies on
|
||||||
|
DPDK's proven base functionality to implement memory management.
|
||||||
|
|
||||||
|
Computing platforms generally carve physical memory up into 4KiB segments
|
||||||
|
called pages. They number the pages from 0 to N starting from the beginning of
|
||||||
|
addressable memory. Operating systems then overlay 4KiB virtual memory pages on
|
||||||
|
top of these physical pages using arbitrarily complex mappings. See
|
||||||
|
[Virtual Memory](https://en.wikipedia.org/wiki/Virtual_memory) for an overview.
|
||||||
|
|
||||||
|
Physical memory is attached on channels, where each memory channel provides
|
||||||
|
some fixed amount of bandwidth. To optimize total memory bandwidth, the
|
||||||
|
physical addressing is often set up to automatically interleave between
|
||||||
|
channels. For instance, page 0 may be located on channel 0, page 1 on channel
|
||||||
|
1, page 2 on channel 2, etc. This is so that writing to memory sequentially
|
||||||
|
automatically utilizes all available channels. In practice, interleaving is
|
||||||
|
done at a much more granular level than a full page.
|
||||||
|
|
||||||
|
Modern computing platforms support hardware acceleration for virtual to
|
||||||
|
physical translation inside of their Memory Management Unit (MMU). The MMU
|
||||||
|
often supports multiple different page sizes. On recent x86_64 systems, 4KiB,
|
||||||
|
2MiB, and 1GiB pages are supported. Typically, operating systems use 4KiB pages
|
||||||
|
by default.
|
||||||
|
|
||||||
|
NVMe devices transfer data to and from system memory using Direct Memory Access
|
||||||
|
(DMA). Specifically, they send messages across the PCI bus requesting data
|
||||||
|
transfers. In the absence of an IOMMU, these messages contain *physical* memory
|
||||||
|
addresses. These data transfers happen without involving the CPU, and the MMU
|
||||||
|
is responsible for making access to memory coherent.
|
||||||
|
|
||||||
|
NVMe devices also may place additional requirements on the physical layout of
|
||||||
|
memory for these transfers. The NVMe 1.0 specification requires all physical
|
||||||
|
memory to be describable by what is called a *PRP list*. To be described by a
|
||||||
|
PRP list, memory must have the following properties:
|
||||||
|
|
||||||
|
* The memory is broken into physical 4KiB pages, which we'll call device pages.
|
||||||
|
* The first device page can be a partial page starting at any 4-byte aligned
|
||||||
|
address. It may extend up to the end of the current physical page, but not
|
||||||
|
beyond.
|
||||||
|
* If there is more than one device page, the first device page must end on a
|
||||||
|
physical 4KiB page boundary.
|
||||||
|
* The last device page begins on a physical 4KiB page boundary, but is not
|
||||||
|
required to end on a physical 4KiB page boundary.
|
||||||
|
|
||||||
|
The specification allows for device pages to be other sizes than 4KiB, but all
|
||||||
|
known devices as of this writing use 4KiB.
|
||||||
|
|
||||||
|
The NVMe 1.1 specification added support for fully flexible scatter gather lists,
|
||||||
|
but the feature is optional and most devices available today do not support it.
|
||||||
|
|
||||||
|
User space drivers run in the context of a regular process and so have access
|
||||||
|
to virtual memory. In order to correctly program the device with physical
|
||||||
|
addresses, some method for address translation must be implemented.
|
||||||
|
|
||||||
|
The simplest way to do this on Linux is to inspect `/proc/self/pagemap` from
|
||||||
|
within a process. This file contains the virtual address to physical address
|
||||||
|
mappings. As of Linux 4.0, accessing these mappings requires root privileges.
|
||||||
|
However, operating systems make absolutely no guarantee that the mapping of
|
||||||
|
virtual to physical pages is static. The operating system has no visibility
|
||||||
|
into whether a PCI device is directly transferring data to a set of physical
|
||||||
|
addresses, so great care must be taken to coordinate DMA requests with page
|
||||||
|
movement. When an operating system flags a page such that the virtual to
|
||||||
|
physical address mapping cannot be modified, this is called **pinning** the
|
||||||
|
page.
|
||||||
|
|
||||||
|
There are several reasons why the virtual to physical mappings may change, too.
|
||||||
|
By far the most common reason is due to page swapping to disk. However, the
|
||||||
|
operating system also moves pages during a process called compaction, which
|
||||||
|
collapses identical virtual pages onto the same physical page to save memory.
|
||||||
|
Some operating systems are also capable of doing transparent memory
|
||||||
|
compression. It is also increasingly possible to hot-add additional memory,
|
||||||
|
which may trigger a physical address rebalance to optimize interleaving.
|
||||||
|
|
||||||
|
POSIX provides the `mlock` call that forces a virtual page of memory to always
|
||||||
|
be backed by a physical page. In effect, this is disabling swapping. This does
|
||||||
|
*not* guarantee, however, that the virtual to physical address mapping is
|
||||||
|
static. The `mlock` call should not be confused with a **pin** call, and it
|
||||||
|
turns out that POSIX does not define an API for pinning memory. Therefore, the
|
||||||
|
mechanism to allocate pinned memory is operating system specific.
|
||||||
|
|
||||||
|
SPDK relies on DPDK to allocate pinned memory. On Linux, DPDK does this by
|
||||||
|
allocating `hugepages` (by default, 2MiB). The Linux kernel treats hugepages
|
||||||
|
differently than regular 4KiB pages. Specifically, the operating system will
|
||||||
|
never change their physical location. This is not by intent, and so things
|
||||||
|
could change in future versions, but it is true today and has been for a number
|
||||||
|
of years (see the later section on the IOMMU for a future-proof solution). DPDK
|
||||||
|
goes through great pains to allocate hugepages such that it can string together
|
||||||
|
the longest runs of physical pages possible, such that it can accomodate
|
||||||
|
physically contiguous allocations larger than a single page.
|
||||||
|
|
||||||
|
With this explanation, hopefully it is now clear why all data buffers passed to
|
||||||
|
SPDK must be allocated using spdk_dma_malloc() or its siblings. The buffers
|
||||||
|
must be allocated specifically so that they are pinned and so that physical
|
||||||
|
addresses are known.
|
||||||
|
|
||||||
|
# IOMMU Support
|
||||||
|
|
||||||
|
Many platforms contain an extra piece of hardware called an I/O Memory
|
||||||
|
Management Unit (IOMMU). An IOMMU is much like a regular MMU, except it
|
||||||
|
provides virtualized address spaces to peripheral devices (i.e. on the PCI
|
||||||
|
bus). The MMU knows about virtual to physical mappings per process on the
|
||||||
|
system, so the IOMMU associates a particular device with one of these mappings
|
||||||
|
and then allows the user to assign arbitrary *bus addresses* to virtual
|
||||||
|
addresses in their process. All DMA operations between the PCI device and
|
||||||
|
system memory are then translated through the IOMMU by converting the bus
|
||||||
|
address to a virtual address and then the virtual address to the physical
|
||||||
|
address. This allows the operating system to freely modify the virtual to
|
||||||
|
physical address mapping without breaking ongoing DMA operations. Linux
|
||||||
|
provides a device driver, `vfio-pci`, that allows a user to configure the IOMMU
|
||||||
|
with their current process.
|
||||||
|
|
||||||
|
This is a future-proof, hardware-accelerated solution for performing DMA
|
||||||
|
operations into and out of a user space process and forms the long-term
|
||||||
|
foundation for SPDK and DPDK's memory management strategy. We highly recommend
|
||||||
|
that applications are deployed using vfio and the IOMMU enabled, which is fully
|
||||||
|
supported today.
|
Loading…
Reference in New Issue
Block a user