doc: Programming guide for block device abstraction layer

Change-Id: Ib27462769e146a2b4b69302eac386255262081f6
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-on: https://review.gerrithub.io/397286
Tested-by: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Reviewed-by: Pawel Wodkowski <pawelx.wodkowski@intel.com>
Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
This commit is contained in:
Ben Walker 2018-01-03 13:33:45 +01:00 committed by Jim Harris
parent 5f947da76b
commit daf33a0921
4 changed files with 301 additions and 1 deletions

View File

@ -789,9 +789,11 @@ INPUT = ../include/spdk \
getting_started.md \
memory.md \
porting.md \
bdev.md \
bdev_module.md \
bdev_pg.md \
blob.md \
blobfs.md \
bdev.md \
event.md \
ioat.md \
iscsi.md \

150
doc/bdev_module.md Normal file
View File

@ -0,0 +1,150 @@
# Writing a Custom Block Device Module {#bdev_module}
## Target Audience
This programming guide is intended for developers authoring their own block
device modules to integrate with SPDK's bdev layer. For a guide on how to use
the bdev layer, see @ref bdev_pg.
## Introduction
A block device module is SPDK's equivalent of a device driver in a traditional
operating system. The module provides a set of function pointers that are
called to service block device I/O requests. SPDK provides a number of block
device modules including NVMe, RAM-disk, and Ceph RBD. However, some users
will want to write their own to interact with either custom hardware or to an
existing storage software stack. This guide is intended to demonstrate exactly
how to write a module.
## Creating A New Module
Block device modules are located in lib/bdev/<module_name> today. It is not
currently possible to place the code for a bdev module elsewhere, but updates
to the build system could be made to enable this in the future. To create a
module, add a new directory with a single C file and a Makefile. A great
starting point is to copy the existing 'null' bdev module.
The primary interface that bdev modules will interact with is in
include/spdk_internal/bdev.h. In that header a macro is defined that declares
a new bdev module named SPDK_BDEV_MODULE_REGISTER. This macro takes as
arguments a number of function pointers that are used to initialize, tear
down, and get the configuration of the module. There are also arguments to
specify context size, which is scratch space that will be allocated in each
I/O request for use by this module, and a callback that will be called each
time a new bdev is registered by another module.
## Creating Bdevs
New bdevs are created within the module by calling spdk_bdev_register(). The
module must allocate a struct spdk_bdev, fill it out appropriately, and pass
it to the register call. The most important field to fill out is `fn_table`,
which points at this data structure:
~~~{.c}
/*
* Function table for a block device backend.
*
* The backend block device function table provides a set of APIs to allow
* communication with a backend. The main commands are read/write API
* calls for I/O via submit_request.
*/
struct spdk_bdev_fn_table {
/* Destroy the backend block device object */
int (*destruct)(void *ctx);
/* Process the IO. */
void (*submit_request)(struct spdk_io_channel *ch, struct spdk_bdev_io *);
/* Check if the block device supports a specific I/O type. */
bool (*io_type_supported)(void *ctx, enum spdk_bdev_io_type);
/* Get an I/O channel for the specific bdev for the calling thread. */
struct spdk_io_channel *(*get_io_channel)(void *ctx);
/*
* Output driver-specific configuration to a JSON stream. Optional - may be NULL.
*
* The JSON write context will be initialized with an open object, so the bdev
* driver should write a name (based on the driver name) followed by a JSON value
* (most likely another nested object).
*/
int (*dump_config_json)(void *ctx, struct spdk_json_write_ctx *w);
/* Get spin-time per I/O channel in microseconds.
* Optional - may be NULL.
*/
uint64_t (*get_spin_time)(struct spdk_io_channel *ch);
};
~~~
The bdev module must implement these function callbacks.
The `destruct` function is called to tear down the device when the system no
longer needs it. What `destruct` does is up to the module - it may just be
freeing memory or it may be shutting down a piece of hardware.
The `io_type_supported` function returns whether a particular I/O type is
supported. The available I/O types are:
~~~{.c}
/** bdev I/O type */
enum spdk_bdev_io_type {
SPDK_BDEV_IO_TYPE_INVALID = 0,
SPDK_BDEV_IO_TYPE_READ,
SPDK_BDEV_IO_TYPE_WRITE,
SPDK_BDEV_IO_TYPE_UNMAP,
SPDK_BDEV_IO_TYPE_FLUSH,
SPDK_BDEV_IO_TYPE_RESET,
SPDK_BDEV_IO_TYPE_NVME_ADMIN,
SPDK_BDEV_IO_TYPE_NVME_IO,
SPDK_BDEV_IO_TYPE_NVME_IO_MD,
SPDK_BDEV_IO_TYPE_WRITE_ZEROES,
};
~~~
For the simplest bdev modules, only `SPDK_BDEV_IO_TYPE_READ` and
`SPDK_BDEV_IO_TYPE_WRITE` are necessary. `SPDK_BDEV_IO_TYPE_UNMAP` is often
referred to as "trim" or "deallocate", and is a request to mark a set of
blocks as no longer containing valid data. `SPDK_BDEV_IO_TYPE_FLUSH` is a
request to make all previously completed writes durable. Many devices do not
require flushes. `SPDK_BDEV_IO_TYPE_WRITE_ZEROES` is just like a regular
write, but does not provide a data buffer (it would have just contained all
0's). If it isn't supported, the generic bdev code is capable of emulating it
by sending regular write requests.
`SPDK_BDEV_IO_TYPE_RESET` is a request to abort all I/O and return the
underlying device to its initial state. Do not complete the reset request
until all I/O has been completed in some way.
`SPDK_BDEV_IO_TYPE_NVME_ADMIN`, `SPDK_BDEV_IO_TYPE_NVME_IO`, and
`SPDK_BDEV_IO_TYPE_NVME_IO_MD` are all mechanisms for passing raw NVMe
commands through the SPDK bdev layer. They're strictly optional, and it
probably only makes sense to implement those if the backing storage device is
capable of handling NVMe commands.
The `get_io_channel` function should return an I/O channel. For a detailed
explanation of I/O channels, see @ref concurrency. The generic bdev layer will
call `get_io_channel` one time per thread, cache the result, and pass that
result to `submit_request`. It will use the corresponding channel for the
thread it calls `submit_request` on.
The `submit_request` function is called to actually submit I/O requests to the
block device. Once the I/O request is completed, the module must call
spdk_bdev_io_complete(). The I/O does not have to finish within the calling
context of `submit_request`.
## Creating Virtual Bdevs
Block devices are considered virtual if they handle I/O requests by routing
the I/O to other block devices. The canonical example would be a bdev module
that implements RAID. Virtual bdevs are created in the same way as regular
bdevs, but take one additional step. The module can look up the underlying
bdevs it wishes to route I/O to using spdk_bdev_get_by_name(), where the string
name is provided by the user in a configuration file or via an RPC. The module
then may proceed is normal by opening the bdev to obtain a descriptor, and
creating I/O channels for the bdev (probably in response to the
`get_io_channel` callback). The final step is to have the module use its open
descriptor to call spdk_bdev_module_claim_bdev(), indicating that it is
consuming the underlying bdev. This prevents other users from opening
descriptors with write permissions. This effectively 'promotes' the descriptor
to write-exclusive and is an operation only available to bdev modules.

146
doc/bdev_pg.md Normal file
View File

@ -0,0 +1,146 @@
# Block Device Layer Programming Guide {#bdev_pg}
## Target Audience
This programming guide is intended for developers authoring applications that
use the SPDK bdev library to access block devices.
## Introduction
A block device is a storage device that supports reading and writing data in
fixed-size blocks. These blocks are usually 512 or 4096 bytes. The
devices may be logical constructs in software or correspond to physical
devices like NVMe SSDs.
The block device layer consists of a single generic library in `lib/bdev`,
plus a number of optional modules (as separate libraries) that implement
various types of block devices. The public header file for the generic library
is bdev.h, which is the entirety of the API needed to interact with any type
of block device. This guide will cover how to interact with bdevs using that
API. For a guide to implementing a bdev module, see @ref bdev_module.
The bdev layer provides a number of useful features in addition to providing a
common abstraction for all block devices:
- Automatic queueing of I/O requests in response to queue full or out-of-memory conditions
- Hot remove support, even while I/O traffic is occurring.
- I/O statistics such as bandwidth and latency
- Device reset support and I/O timeout tracking
## Basic Primitives
Users of the bdev API interact with a number of basic objects.
struct spdk_bdev, which this guide will refer to as a *bdev*, represents a
generic block device. struct spdk_bdev_desc, heretofore called a *descriptor*,
represents a handle to a given block device. Descriptors are used to establish
and track permissions to use the underlying block device, much like a file
descriptor on UNIX systems. Requests to the block device are asynchronous and
represented by spdk_bdev_io objects. Requests must be submitted on an
associated I/O channel. The motivation and design of I/O channels is described
in @ref concurrency.
Bdevs can be layered, such that some bdevs service I/O by routing requests to
other bdevs. This can be used to implement caching, RAID, logical volume
management, and more. Bdevs that route I/O to other bdevs are often referred
to as virtual bdevs, or *vbdevs* for short.
## Initializing The Library
The bdev layer depends on the generic message passing infrastructure
abstracted by the header file include/io_channel.h. See @ref concurrency for a
full description. Most importantly, calls into the bdev library may only be
made from threads that have been allocated with SPDK by calling
spdk_allocate_thread().
From an allocated thread, the bdev library may be initialized by calling
spdk_bdev_initialize(), which is an asynchronous operation. Until the completion
callback is called, no other bdev library functions may be invoked. Similarly,
to tear down the bdev library, call spdk_bdev_finish.
## Discovering Block Devices
All block devices have a simple string name. At any time, a pointer to the
device object can be obtained by calling spdk_bdev_get_by_name(), or the entire
set of bdevs may be iterated using spdk_bdev_first() and spdk_bdev_next() and
their variants.
Some block devices may also be given aliases, which are also string names.
Aliases behave like symlinks - they can be used interchangeably with the real
name to look up the block device.
## Preparing To Use A Block Device
In order to send I/O requests to a block device, it must first be opened by
calling spdk_bdev_open(). This will return a descriptor. Multiple users may have
a bdev open at the same time, and coordination of reads and writes between
users must be handled by some higher level mechanism outside of the bdev
layer. Opening a bdev with write permission may fail if a virtual bdev module
has *claimed* the bdev. Virtual bdev modules implement logic like RAID or
logical volume management and forward their I/O to lower level bdevs, so they
mark these lower level bdevs as claimed to prevent outside users from issuing
writes.
When a block device is opened, an optional callback and context can be
provided that will be called if the underlying storage servicing the block
device is removed. For example, the remove callback will be called on each
open descriptor for a bdev backed by a physical NVMe SSD when the NVMe SSD is
hot-unplugged. The callback can be thought of as a request to close the open
descriptor so other memory may be freed. A bdev cannot be torn down while open
descriptors exist, so it is highly recommended that a callback is provided.
When a user is done with a descriptor, they may release it by calling
spdk_bdev_close().
Descriptors may be passed to and used from multiple threads simultaneously.
However, for each thread a separate I/O channel must be obtained by calling
spdk_bdev_get_io_channel(). This will allocate the necessary per-thread
resources to submit I/O requests to the bdev without taking locks. To release
a channel, call spdk_put_io_channel(). A descriptor cannot be closed until
all associated channels have been destroyed.
## Sending I/O
Once a descriptor and a channel have been obtained, I/O may be sent by calling
the various I/O submission functions such as spdk_bdev_read(). These calls each
take a callback as an argument which will be called some time later with a
handle to an spdk_bdev_io object. In response to that completion, the user
must call spdk_free_bdev_io() to release the resources. Within this callback,
the user may also use the functions spdk_bdev_io_get_nvme_status() and
spdk_bdev_io_get_scsi_status() to obtain error information in the format of
their choosing.
I/O submission is performed by calling functions such as spdk_bdev_read() or
spdk_bdev_write(). These functions take as an argument a pointer to a region of
memory or a scatter gather list describing memory that will be transferred to
the block device. This memory must be allocated through spdk_dma_malloc() or
its variants. For a full explanation of why the memory must come from a
special allocation pool, see @ref memory. Where possible, data in memory will
be *directly transferred to the block device* using
[Direct Memory Access](https://en.wikipedia.org/wiki/Direct_memory_access).
That means it is not copied.
All I/O submission functions are asynchronous and non-blocking. They will not
block or stall the thread for any reason. However, the I/O submission
functions may fail in one of two ways. First, they may fail immediately and
return an error code. In that case, the provided callback will not be called.
Second, they may fail asynchronously. In that case, the associated
spdk_bdev_io will be passed to the callback and it will report error
information.
Some I/O request types are optional and may not be supported by a given bdev.
To query a bdev for the I/O request types it supports, call
spdk_bdev_io_type_supported().
## Resetting A Block Device
In order to handle unexpected failure conditions, the bdev library provides a
mechanism to perform a device reset by calling spdk_bdev_reset(). This will pass
a message to every other thread for which an I/O channel exists for the bdev,
pause it, then forward a reset request to the underlying bdev module and wait
for completion. Upon completion, the I/O channels will resume and the reset
will complete. The specific behavior inside the bdev module is
module-specific. For example, NVMe devices will delete all queue pairs,
perform an NVMe reset, then recreate the queue pairs and continue. Most
importantly, regardless of device type, *all I/O outstanding to the block
device will be completed prior to the reset completing.*

View File

@ -25,6 +25,8 @@
# Programmer Guides {#general}
- @ref bdev_pg
- @ref bdev_module
- @ref directory_structure
- [Public API header files](files.html)