Add documentation explaining memory management

Change-Id: Ifce9507fc327ee090d4a825323df928d440fe025 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/362273 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
2017-03-31 13:49:15 -07:00 · 2017-03-31 13:49:15 -07:00 · 4ee35e1f35
commit 4ee35e1f35
parent da0dfddb35
3 changed files with 126 additions and 5 deletions
--- a/doc/Doxyfile
+++ b/doc/Doxyfile
@ -783,6 +783,7 @@ WARN_LOGFILE           =
 INPUT                  = ../include/spdk \
                         index.md \
                         directory_structure.md \
                         memory.md \
                         porting.md \
                         blob.md \
                         blobfs.md \
--- a/doc/index.md
+++ b/doc/index.md
@ -5,15 +5,17 @@
 - [SPDK on GitHub](https://github.com/spdk/spdk/)
 - [SPDK.io](http://www.spdk.io/)
-The Storage Performance Development Kit (SPDK) provides a set of tools and libraries
+The Storage Performance Development Kit (SPDK) provides a set of tools and
-for writing high performance, scalable, user-mode storage applications.
+libraries for writing high performance, scalable, user-mode storage
-It achieves high performance by moving all of the necessary drivers
+applications. It achieves high performance by moving all of the necessary
-into userspace and operating in a polled mode instead of relying on interrupts,
+drivers into userspace and operating in a polled mode instead of relying on
-which avoids kernel context switches and eliminates interrupt handling overhead.
+interrupts, which avoids kernel context switches and eliminates interrupt
 handling overhead.
 ## General Information {#general}
 - @ref directory_structure
 - @ref memory
 - @ref porting
 - [Public API header files](files.html)
--- a/doc/memory.md
+++ b/doc/memory.md
@ -0,0 +1,118 @@
 # Memory Management for User Space Drivers {#memory}
 The following is an attempt to explain why all data buffers passed to SPDK must
 be allocated using spdk_dma_malloc() or its siblings, and why SPDK relies on
 DPDK's proven base functionality to implement memory management.
 Computing platforms generally carve physical memory up into 4KiB segments
 called pages. They number the pages from 0 to N starting from the beginning of
 addressable memory. Operating systems then overlay 4KiB virtual memory pages on
 top of these physical pages using arbitrarily complex mappings. See
 [Virtual Memory](https://en.wikipedia.org/wiki/Virtual_memory) for an overview.
 Physical memory is attached on channels, where each memory channel provides
 some fixed amount of bandwidth. To optimize total memory bandwidth, the
 physical addressing is often set up to automatically interleave between
 channels. For instance, page 0 may be located on channel 0, page 1 on channel
 1, page 2 on channel 2, etc. This is so that writing to memory sequentially
 automatically utilizes all available channels. In practice, interleaving is
 done at a much more granular level than a full page.
 Modern computing platforms support hardware acceleration for virtual to
 physical translation inside of their Memory Management Unit (MMU). The MMU
 often supports multiple different page sizes. On recent x86_64 systems, 4KiB,
 2MiB, and 1GiB pages are supported. Typically, operating systems use 4KiB pages
 by default.
 NVMe devices transfer data to and from system memory using Direct Memory Access
 (DMA). Specifically, they send messages across the PCI bus requesting data
 transfers. In the absence of an IOMMU, these messages contain *physical* memory
 addresses. These data transfers happen without involving the CPU, and the MMU
 is responsible for making access to memory coherent.
 NVMe devices also may place additional requirements on the physical layout of
 memory for these transfers. The NVMe 1.0 specification requires all physical
 memory to be describable by what is called a *PRP list*. To be described by a
 PRP list, memory must have the following properties:
 * The memory is broken into physical 4KiB pages, which we'll call device pages.
 * The first device page can be a partial page starting at any 4-byte aligned
  address. It may extend up to the end of the current physical page, but not
  beyond.
 * If there is more than one device page, the first device page must end on a
  physical 4KiB page boundary.
 * The last device page begins on a physical 4KiB page boundary, but is not
  required to end on a physical 4KiB page boundary.
 The specification allows for device pages to be other sizes than 4KiB, but all
 known devices as of this writing use 4KiB.
 The NVMe 1.1 specification added support for fully flexible scatter gather lists,
 but the feature is optional and most devices available today do not support it.
 User space drivers run in the context of a regular process and so have access
 to virtual memory. In order to correctly program the device with physical
 addresses, some method for address translation must be implemented.
 The simplest way to do this on Linux is to inspect `/proc/self/pagemap` from
 within a process. This file contains the virtual address to physical address
 mappings. As of Linux 4.0, accessing these mappings requires root privileges.
 However, operating systems make absolutely no guarantee that the mapping of
 virtual to physical pages is static. The operating system has no visibility
 into whether a PCI device is directly transferring data to a set of physical
 addresses, so great care must be taken to coordinate DMA requests with page
 movement. When an operating system flags a page such that the virtual to
 physical address mapping cannot be modified, this is called **pinning** the
 page.
 There are several reasons why the virtual to physical mappings may change, too.
 By far the most common reason is due to page swapping to disk. However, the
 operating system also moves pages during a process called compaction, which
 collapses identical virtual pages onto the same physical page to save memory.
 Some operating systems are also capable of doing transparent memory
 compression. It is also increasingly possible to hot-add additional memory,
 which may trigger a physical address rebalance to optimize interleaving.
 POSIX provides the `mlock` call that forces a virtual page of memory to always
 be backed by a physical page. In effect, this is disabling swapping. This does
 *not* guarantee, however, that the virtual to physical address mapping is
 static. The `mlock` call should not be confused with a **pin** call, and it
 turns out that POSIX does not define an API for pinning memory. Therefore, the
 mechanism to allocate pinned memory is operating system specific.
 SPDK relies on DPDK to allocate pinned memory. On Linux, DPDK does this by
 allocating `hugepages` (by default, 2MiB). The Linux kernel treats hugepages
 differently than regular 4KiB pages. Specifically, the operating system will
 never change their physical location. This is not by intent, and so things
 could change in future versions, but it is true today and has been for a number
 of years (see the later section on the IOMMU for a future-proof solution). DPDK
 goes through great pains to allocate hugepages such that it can string together
 the longest runs of physical pages possible, such that it can accomodate
 physically contiguous allocations larger than a single page.
 With this explanation, hopefully it is now clear why all data buffers passed to
 SPDK must be allocated using spdk_dma_malloc() or its siblings. The buffers
 must be allocated specifically so that they are pinned and so that physical
 addresses are known.
 # IOMMU Support
 Many platforms contain an extra piece of hardware called an I/O Memory
 Management Unit (IOMMU). An IOMMU is much like a regular MMU, except it
 provides virtualized address spaces to peripheral devices (i.e. on the PCI
 bus). The MMU knows about virtual to physical mappings per process on the
 system, so the IOMMU associates a particular device with one of these mappings
 and then allows the user to assign arbitrary *bus addresses* to virtual
 addresses in their process. All DMA operations between the PCI device and
 system memory are then translated through the IOMMU by converting the bus
 address to a virtual address and then the virtual address to the physical
 address. This allows the operating system to freely modify the virtual to
 physical address mapping without breaking ongoing DMA operations. Linux
 provides a device driver, `vfio-pci`, that allows a user to configure the IOMMU
 with their current process.
 This is a future-proof, hardware-accelerated solution for performing DMA
 operations into and out of a user space process and forms the long-term
 foundation for SPDK and DPDK's memory management strategy. We highly recommend
 that applications are deployed using vfio and the IOMMU enabled, which is fully
 supported today.