diff --git a/doc/Doxyfile b/doc/Doxyfile index e322eebfb..c6cf9e6a5 100644 --- a/doc/Doxyfile +++ b/doc/Doxyfile @@ -789,9 +789,11 @@ INPUT = ../include/spdk \ getting_started.md \ memory.md \ porting.md \ + bdev.md \ + bdev_module.md \ + bdev_pg.md \ blob.md \ blobfs.md \ - bdev.md \ event.md \ ioat.md \ iscsi.md \ diff --git a/doc/bdev_module.md b/doc/bdev_module.md new file mode 100644 index 000000000..f7ed93909 --- /dev/null +++ b/doc/bdev_module.md @@ -0,0 +1,150 @@ +# Writing a Custom Block Device Module {#bdev_module} + +## Target Audience + +This programming guide is intended for developers authoring their own block +device modules to integrate with SPDK's bdev layer. For a guide on how to use +the bdev layer, see @ref bdev_pg. + +## Introduction + +A block device module is SPDK's equivalent of a device driver in a traditional +operating system. The module provides a set of function pointers that are +called to service block device I/O requests. SPDK provides a number of block +device modules including NVMe, RAM-disk, and Ceph RBD. However, some users +will want to write their own to interact with either custom hardware or to an +existing storage software stack. This guide is intended to demonstrate exactly +how to write a module. + +## Creating A New Module + +Block device modules are located in lib/bdev/ today. It is not +currently possible to place the code for a bdev module elsewhere, but updates +to the build system could be made to enable this in the future. To create a +module, add a new directory with a single C file and a Makefile. A great +starting point is to copy the existing 'null' bdev module. + +The primary interface that bdev modules will interact with is in +include/spdk_internal/bdev.h. In that header a macro is defined that declares +a new bdev module named SPDK_BDEV_MODULE_REGISTER. This macro takes as +arguments a number of function pointers that are used to initialize, tear +down, and get the configuration of the module. There are also arguments to +specify context size, which is scratch space that will be allocated in each +I/O request for use by this module, and a callback that will be called each +time a new bdev is registered by another module. + +## Creating Bdevs + +New bdevs are created within the module by calling spdk_bdev_register(). The +module must allocate a struct spdk_bdev, fill it out appropriately, and pass +it to the register call. The most important field to fill out is `fn_table`, +which points at this data structure: + +~~~{.c} +/* + * Function table for a block device backend. + * + * The backend block device function table provides a set of APIs to allow + * communication with a backend. The main commands are read/write API + * calls for I/O via submit_request. + */ +struct spdk_bdev_fn_table { + /* Destroy the backend block device object */ + int (*destruct)(void *ctx); + + /* Process the IO. */ + void (*submit_request)(struct spdk_io_channel *ch, struct spdk_bdev_io *); + + /* Check if the block device supports a specific I/O type. */ + bool (*io_type_supported)(void *ctx, enum spdk_bdev_io_type); + + /* Get an I/O channel for the specific bdev for the calling thread. */ + struct spdk_io_channel *(*get_io_channel)(void *ctx); + + /* + * Output driver-specific configuration to a JSON stream. Optional - may be NULL. + * + * The JSON write context will be initialized with an open object, so the bdev + * driver should write a name (based on the driver name) followed by a JSON value + * (most likely another nested object). + */ + int (*dump_config_json)(void *ctx, struct spdk_json_write_ctx *w); + + /* Get spin-time per I/O channel in microseconds. + * Optional - may be NULL. + */ + uint64_t (*get_spin_time)(struct spdk_io_channel *ch); +}; +~~~ + +The bdev module must implement these function callbacks. + +The `destruct` function is called to tear down the device when the system no +longer needs it. What `destruct` does is up to the module - it may just be +freeing memory or it may be shutting down a piece of hardware. + +The `io_type_supported` function returns whether a particular I/O type is +supported. The available I/O types are: + +~~~{.c} +/** bdev I/O type */ +enum spdk_bdev_io_type { + SPDK_BDEV_IO_TYPE_INVALID = 0, + SPDK_BDEV_IO_TYPE_READ, + SPDK_BDEV_IO_TYPE_WRITE, + SPDK_BDEV_IO_TYPE_UNMAP, + SPDK_BDEV_IO_TYPE_FLUSH, + SPDK_BDEV_IO_TYPE_RESET, + SPDK_BDEV_IO_TYPE_NVME_ADMIN, + SPDK_BDEV_IO_TYPE_NVME_IO, + SPDK_BDEV_IO_TYPE_NVME_IO_MD, + SPDK_BDEV_IO_TYPE_WRITE_ZEROES, +}; +~~~ + +For the simplest bdev modules, only `SPDK_BDEV_IO_TYPE_READ` and +`SPDK_BDEV_IO_TYPE_WRITE` are necessary. `SPDK_BDEV_IO_TYPE_UNMAP` is often +referred to as "trim" or "deallocate", and is a request to mark a set of +blocks as no longer containing valid data. `SPDK_BDEV_IO_TYPE_FLUSH` is a +request to make all previously completed writes durable. Many devices do not +require flushes. `SPDK_BDEV_IO_TYPE_WRITE_ZEROES` is just like a regular +write, but does not provide a data buffer (it would have just contained all +0's). If it isn't supported, the generic bdev code is capable of emulating it +by sending regular write requests. + +`SPDK_BDEV_IO_TYPE_RESET` is a request to abort all I/O and return the +underlying device to its initial state. Do not complete the reset request +until all I/O has been completed in some way. + +`SPDK_BDEV_IO_TYPE_NVME_ADMIN`, `SPDK_BDEV_IO_TYPE_NVME_IO`, and +`SPDK_BDEV_IO_TYPE_NVME_IO_MD` are all mechanisms for passing raw NVMe +commands through the SPDK bdev layer. They're strictly optional, and it +probably only makes sense to implement those if the backing storage device is +capable of handling NVMe commands. + +The `get_io_channel` function should return an I/O channel. For a detailed +explanation of I/O channels, see @ref concurrency. The generic bdev layer will +call `get_io_channel` one time per thread, cache the result, and pass that +result to `submit_request`. It will use the corresponding channel for the +thread it calls `submit_request` on. + +The `submit_request` function is called to actually submit I/O requests to the +block device. Once the I/O request is completed, the module must call +spdk_bdev_io_complete(). The I/O does not have to finish within the calling +context of `submit_request`. + +## Creating Virtual Bdevs + +Block devices are considered virtual if they handle I/O requests by routing +the I/O to other block devices. The canonical example would be a bdev module +that implements RAID. Virtual bdevs are created in the same way as regular +bdevs, but take one additional step. The module can look up the underlying +bdevs it wishes to route I/O to using spdk_bdev_get_by_name(), where the string +name is provided by the user in a configuration file or via an RPC. The module +then may proceed is normal by opening the bdev to obtain a descriptor, and +creating I/O channels for the bdev (probably in response to the +`get_io_channel` callback). The final step is to have the module use its open +descriptor to call spdk_bdev_module_claim_bdev(), indicating that it is +consuming the underlying bdev. This prevents other users from opening +descriptors with write permissions. This effectively 'promotes' the descriptor +to write-exclusive and is an operation only available to bdev modules. diff --git a/doc/bdev_pg.md b/doc/bdev_pg.md new file mode 100644 index 000000000..11f6ec529 --- /dev/null +++ b/doc/bdev_pg.md @@ -0,0 +1,146 @@ +# Block Device Layer Programming Guide {#bdev_pg} + +## Target Audience + +This programming guide is intended for developers authoring applications that +use the SPDK bdev library to access block devices. + +## Introduction + +A block device is a storage device that supports reading and writing data in +fixed-size blocks. These blocks are usually 512 or 4096 bytes. The +devices may be logical constructs in software or correspond to physical +devices like NVMe SSDs. + +The block device layer consists of a single generic library in `lib/bdev`, +plus a number of optional modules (as separate libraries) that implement +various types of block devices. The public header file for the generic library +is bdev.h, which is the entirety of the API needed to interact with any type +of block device. This guide will cover how to interact with bdevs using that +API. For a guide to implementing a bdev module, see @ref bdev_module. + +The bdev layer provides a number of useful features in addition to providing a +common abstraction for all block devices: + +- Automatic queueing of I/O requests in response to queue full or out-of-memory conditions +- Hot remove support, even while I/O traffic is occurring. +- I/O statistics such as bandwidth and latency +- Device reset support and I/O timeout tracking + +## Basic Primitives + +Users of the bdev API interact with a number of basic objects. + +struct spdk_bdev, which this guide will refer to as a *bdev*, represents a +generic block device. struct spdk_bdev_desc, heretofore called a *descriptor*, +represents a handle to a given block device. Descriptors are used to establish +and track permissions to use the underlying block device, much like a file +descriptor on UNIX systems. Requests to the block device are asynchronous and +represented by spdk_bdev_io objects. Requests must be submitted on an +associated I/O channel. The motivation and design of I/O channels is described +in @ref concurrency. + +Bdevs can be layered, such that some bdevs service I/O by routing requests to +other bdevs. This can be used to implement caching, RAID, logical volume +management, and more. Bdevs that route I/O to other bdevs are often referred +to as virtual bdevs, or *vbdevs* for short. + +## Initializing The Library + +The bdev layer depends on the generic message passing infrastructure +abstracted by the header file include/io_channel.h. See @ref concurrency for a +full description. Most importantly, calls into the bdev library may only be +made from threads that have been allocated with SPDK by calling +spdk_allocate_thread(). + +From an allocated thread, the bdev library may be initialized by calling +spdk_bdev_initialize(), which is an asynchronous operation. Until the completion +callback is called, no other bdev library functions may be invoked. Similarly, +to tear down the bdev library, call spdk_bdev_finish. + +## Discovering Block Devices + +All block devices have a simple string name. At any time, a pointer to the +device object can be obtained by calling spdk_bdev_get_by_name(), or the entire +set of bdevs may be iterated using spdk_bdev_first() and spdk_bdev_next() and +their variants. + +Some block devices may also be given aliases, which are also string names. +Aliases behave like symlinks - they can be used interchangeably with the real +name to look up the block device. + +## Preparing To Use A Block Device + +In order to send I/O requests to a block device, it must first be opened by +calling spdk_bdev_open(). This will return a descriptor. Multiple users may have +a bdev open at the same time, and coordination of reads and writes between +users must be handled by some higher level mechanism outside of the bdev +layer. Opening a bdev with write permission may fail if a virtual bdev module +has *claimed* the bdev. Virtual bdev modules implement logic like RAID or +logical volume management and forward their I/O to lower level bdevs, so they +mark these lower level bdevs as claimed to prevent outside users from issuing +writes. + +When a block device is opened, an optional callback and context can be +provided that will be called if the underlying storage servicing the block +device is removed. For example, the remove callback will be called on each +open descriptor for a bdev backed by a physical NVMe SSD when the NVMe SSD is +hot-unplugged. The callback can be thought of as a request to close the open +descriptor so other memory may be freed. A bdev cannot be torn down while open +descriptors exist, so it is highly recommended that a callback is provided. + +When a user is done with a descriptor, they may release it by calling +spdk_bdev_close(). + +Descriptors may be passed to and used from multiple threads simultaneously. +However, for each thread a separate I/O channel must be obtained by calling +spdk_bdev_get_io_channel(). This will allocate the necessary per-thread +resources to submit I/O requests to the bdev without taking locks. To release +a channel, call spdk_put_io_channel(). A descriptor cannot be closed until +all associated channels have been destroyed. + +## Sending I/O + +Once a descriptor and a channel have been obtained, I/O may be sent by calling +the various I/O submission functions such as spdk_bdev_read(). These calls each +take a callback as an argument which will be called some time later with a +handle to an spdk_bdev_io object. In response to that completion, the user +must call spdk_free_bdev_io() to release the resources. Within this callback, +the user may also use the functions spdk_bdev_io_get_nvme_status() and +spdk_bdev_io_get_scsi_status() to obtain error information in the format of +their choosing. + +I/O submission is performed by calling functions such as spdk_bdev_read() or +spdk_bdev_write(). These functions take as an argument a pointer to a region of +memory or a scatter gather list describing memory that will be transferred to +the block device. This memory must be allocated through spdk_dma_malloc() or +its variants. For a full explanation of why the memory must come from a +special allocation pool, see @ref memory. Where possible, data in memory will +be *directly transferred to the block device* using +[Direct Memory Access](https://en.wikipedia.org/wiki/Direct_memory_access). +That means it is not copied. + +All I/O submission functions are asynchronous and non-blocking. They will not +block or stall the thread for any reason. However, the I/O submission +functions may fail in one of two ways. First, they may fail immediately and +return an error code. In that case, the provided callback will not be called. +Second, they may fail asynchronously. In that case, the associated +spdk_bdev_io will be passed to the callback and it will report error +information. + +Some I/O request types are optional and may not be supported by a given bdev. +To query a bdev for the I/O request types it supports, call +spdk_bdev_io_type_supported(). + +## Resetting A Block Device + +In order to handle unexpected failure conditions, the bdev library provides a +mechanism to perform a device reset by calling spdk_bdev_reset(). This will pass +a message to every other thread for which an I/O channel exists for the bdev, +pause it, then forward a reset request to the underlying bdev module and wait +for completion. Upon completion, the I/O channels will resume and the reset +will complete. The specific behavior inside the bdev module is +module-specific. For example, NVMe devices will delete all queue pairs, +perform an NVMe reset, then recreate the queue pairs and continue. Most +importantly, regardless of device type, *all I/O outstanding to the block +device will be completed prior to the reset completing.* diff --git a/doc/index.md b/doc/index.md index 677bc52a8..8d9b27389 100644 --- a/doc/index.md +++ b/doc/index.md @@ -25,6 +25,8 @@ # Programmer Guides {#general} +- @ref bdev_pg +- @ref bdev_module - @ref directory_structure - [Public API header files](files.html)