doc: Programming guide for block device abstraction layer
Change-Id: Ib27462769e146a2b4b69302eac386255262081f6 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/397286 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Reviewed-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
This commit is contained in:
parent
5f947da76b
commit
daf33a0921
@ -789,9 +789,11 @@ INPUT = ../include/spdk \
|
||||
getting_started.md \
|
||||
memory.md \
|
||||
porting.md \
|
||||
bdev.md \
|
||||
bdev_module.md \
|
||||
bdev_pg.md \
|
||||
blob.md \
|
||||
blobfs.md \
|
||||
bdev.md \
|
||||
event.md \
|
||||
ioat.md \
|
||||
iscsi.md \
|
||||
|
150
doc/bdev_module.md
Normal file
150
doc/bdev_module.md
Normal file
@ -0,0 +1,150 @@
|
||||
# Writing a Custom Block Device Module {#bdev_module}
|
||||
|
||||
## Target Audience
|
||||
|
||||
This programming guide is intended for developers authoring their own block
|
||||
device modules to integrate with SPDK's bdev layer. For a guide on how to use
|
||||
the bdev layer, see @ref bdev_pg.
|
||||
|
||||
## Introduction
|
||||
|
||||
A block device module is SPDK's equivalent of a device driver in a traditional
|
||||
operating system. The module provides a set of function pointers that are
|
||||
called to service block device I/O requests. SPDK provides a number of block
|
||||
device modules including NVMe, RAM-disk, and Ceph RBD. However, some users
|
||||
will want to write their own to interact with either custom hardware or to an
|
||||
existing storage software stack. This guide is intended to demonstrate exactly
|
||||
how to write a module.
|
||||
|
||||
## Creating A New Module
|
||||
|
||||
Block device modules are located in lib/bdev/<module_name> today. It is not
|
||||
currently possible to place the code for a bdev module elsewhere, but updates
|
||||
to the build system could be made to enable this in the future. To create a
|
||||
module, add a new directory with a single C file and a Makefile. A great
|
||||
starting point is to copy the existing 'null' bdev module.
|
||||
|
||||
The primary interface that bdev modules will interact with is in
|
||||
include/spdk_internal/bdev.h. In that header a macro is defined that declares
|
||||
a new bdev module named SPDK_BDEV_MODULE_REGISTER. This macro takes as
|
||||
arguments a number of function pointers that are used to initialize, tear
|
||||
down, and get the configuration of the module. There are also arguments to
|
||||
specify context size, which is scratch space that will be allocated in each
|
||||
I/O request for use by this module, and a callback that will be called each
|
||||
time a new bdev is registered by another module.
|
||||
|
||||
## Creating Bdevs
|
||||
|
||||
New bdevs are created within the module by calling spdk_bdev_register(). The
|
||||
module must allocate a struct spdk_bdev, fill it out appropriately, and pass
|
||||
it to the register call. The most important field to fill out is `fn_table`,
|
||||
which points at this data structure:
|
||||
|
||||
~~~{.c}
|
||||
/*
|
||||
* Function table for a block device backend.
|
||||
*
|
||||
* The backend block device function table provides a set of APIs to allow
|
||||
* communication with a backend. The main commands are read/write API
|
||||
* calls for I/O via submit_request.
|
||||
*/
|
||||
struct spdk_bdev_fn_table {
|
||||
/* Destroy the backend block device object */
|
||||
int (*destruct)(void *ctx);
|
||||
|
||||
/* Process the IO. */
|
||||
void (*submit_request)(struct spdk_io_channel *ch, struct spdk_bdev_io *);
|
||||
|
||||
/* Check if the block device supports a specific I/O type. */
|
||||
bool (*io_type_supported)(void *ctx, enum spdk_bdev_io_type);
|
||||
|
||||
/* Get an I/O channel for the specific bdev for the calling thread. */
|
||||
struct spdk_io_channel *(*get_io_channel)(void *ctx);
|
||||
|
||||
/*
|
||||
* Output driver-specific configuration to a JSON stream. Optional - may be NULL.
|
||||
*
|
||||
* The JSON write context will be initialized with an open object, so the bdev
|
||||
* driver should write a name (based on the driver name) followed by a JSON value
|
||||
* (most likely another nested object).
|
||||
*/
|
||||
int (*dump_config_json)(void *ctx, struct spdk_json_write_ctx *w);
|
||||
|
||||
/* Get spin-time per I/O channel in microseconds.
|
||||
* Optional - may be NULL.
|
||||
*/
|
||||
uint64_t (*get_spin_time)(struct spdk_io_channel *ch);
|
||||
};
|
||||
~~~
|
||||
|
||||
The bdev module must implement these function callbacks.
|
||||
|
||||
The `destruct` function is called to tear down the device when the system no
|
||||
longer needs it. What `destruct` does is up to the module - it may just be
|
||||
freeing memory or it may be shutting down a piece of hardware.
|
||||
|
||||
The `io_type_supported` function returns whether a particular I/O type is
|
||||
supported. The available I/O types are:
|
||||
|
||||
~~~{.c}
|
||||
/** bdev I/O type */
|
||||
enum spdk_bdev_io_type {
|
||||
SPDK_BDEV_IO_TYPE_INVALID = 0,
|
||||
SPDK_BDEV_IO_TYPE_READ,
|
||||
SPDK_BDEV_IO_TYPE_WRITE,
|
||||
SPDK_BDEV_IO_TYPE_UNMAP,
|
||||
SPDK_BDEV_IO_TYPE_FLUSH,
|
||||
SPDK_BDEV_IO_TYPE_RESET,
|
||||
SPDK_BDEV_IO_TYPE_NVME_ADMIN,
|
||||
SPDK_BDEV_IO_TYPE_NVME_IO,
|
||||
SPDK_BDEV_IO_TYPE_NVME_IO_MD,
|
||||
SPDK_BDEV_IO_TYPE_WRITE_ZEROES,
|
||||
};
|
||||
~~~
|
||||
|
||||
For the simplest bdev modules, only `SPDK_BDEV_IO_TYPE_READ` and
|
||||
`SPDK_BDEV_IO_TYPE_WRITE` are necessary. `SPDK_BDEV_IO_TYPE_UNMAP` is often
|
||||
referred to as "trim" or "deallocate", and is a request to mark a set of
|
||||
blocks as no longer containing valid data. `SPDK_BDEV_IO_TYPE_FLUSH` is a
|
||||
request to make all previously completed writes durable. Many devices do not
|
||||
require flushes. `SPDK_BDEV_IO_TYPE_WRITE_ZEROES` is just like a regular
|
||||
write, but does not provide a data buffer (it would have just contained all
|
||||
0's). If it isn't supported, the generic bdev code is capable of emulating it
|
||||
by sending regular write requests.
|
||||
|
||||
`SPDK_BDEV_IO_TYPE_RESET` is a request to abort all I/O and return the
|
||||
underlying device to its initial state. Do not complete the reset request
|
||||
until all I/O has been completed in some way.
|
||||
|
||||
`SPDK_BDEV_IO_TYPE_NVME_ADMIN`, `SPDK_BDEV_IO_TYPE_NVME_IO`, and
|
||||
`SPDK_BDEV_IO_TYPE_NVME_IO_MD` are all mechanisms for passing raw NVMe
|
||||
commands through the SPDK bdev layer. They're strictly optional, and it
|
||||
probably only makes sense to implement those if the backing storage device is
|
||||
capable of handling NVMe commands.
|
||||
|
||||
The `get_io_channel` function should return an I/O channel. For a detailed
|
||||
explanation of I/O channels, see @ref concurrency. The generic bdev layer will
|
||||
call `get_io_channel` one time per thread, cache the result, and pass that
|
||||
result to `submit_request`. It will use the corresponding channel for the
|
||||
thread it calls `submit_request` on.
|
||||
|
||||
The `submit_request` function is called to actually submit I/O requests to the
|
||||
block device. Once the I/O request is completed, the module must call
|
||||
spdk_bdev_io_complete(). The I/O does not have to finish within the calling
|
||||
context of `submit_request`.
|
||||
|
||||
## Creating Virtual Bdevs
|
||||
|
||||
Block devices are considered virtual if they handle I/O requests by routing
|
||||
the I/O to other block devices. The canonical example would be a bdev module
|
||||
that implements RAID. Virtual bdevs are created in the same way as regular
|
||||
bdevs, but take one additional step. The module can look up the underlying
|
||||
bdevs it wishes to route I/O to using spdk_bdev_get_by_name(), where the string
|
||||
name is provided by the user in a configuration file or via an RPC. The module
|
||||
then may proceed is normal by opening the bdev to obtain a descriptor, and
|
||||
creating I/O channels for the bdev (probably in response to the
|
||||
`get_io_channel` callback). The final step is to have the module use its open
|
||||
descriptor to call spdk_bdev_module_claim_bdev(), indicating that it is
|
||||
consuming the underlying bdev. This prevents other users from opening
|
||||
descriptors with write permissions. This effectively 'promotes' the descriptor
|
||||
to write-exclusive and is an operation only available to bdev modules.
|
146
doc/bdev_pg.md
Normal file
146
doc/bdev_pg.md
Normal file
@ -0,0 +1,146 @@
|
||||
# Block Device Layer Programming Guide {#bdev_pg}
|
||||
|
||||
## Target Audience
|
||||
|
||||
This programming guide is intended for developers authoring applications that
|
||||
use the SPDK bdev library to access block devices.
|
||||
|
||||
## Introduction
|
||||
|
||||
A block device is a storage device that supports reading and writing data in
|
||||
fixed-size blocks. These blocks are usually 512 or 4096 bytes. The
|
||||
devices may be logical constructs in software or correspond to physical
|
||||
devices like NVMe SSDs.
|
||||
|
||||
The block device layer consists of a single generic library in `lib/bdev`,
|
||||
plus a number of optional modules (as separate libraries) that implement
|
||||
various types of block devices. The public header file for the generic library
|
||||
is bdev.h, which is the entirety of the API needed to interact with any type
|
||||
of block device. This guide will cover how to interact with bdevs using that
|
||||
API. For a guide to implementing a bdev module, see @ref bdev_module.
|
||||
|
||||
The bdev layer provides a number of useful features in addition to providing a
|
||||
common abstraction for all block devices:
|
||||
|
||||
- Automatic queueing of I/O requests in response to queue full or out-of-memory conditions
|
||||
- Hot remove support, even while I/O traffic is occurring.
|
||||
- I/O statistics such as bandwidth and latency
|
||||
- Device reset support and I/O timeout tracking
|
||||
|
||||
## Basic Primitives
|
||||
|
||||
Users of the bdev API interact with a number of basic objects.
|
||||
|
||||
struct spdk_bdev, which this guide will refer to as a *bdev*, represents a
|
||||
generic block device. struct spdk_bdev_desc, heretofore called a *descriptor*,
|
||||
represents a handle to a given block device. Descriptors are used to establish
|
||||
and track permissions to use the underlying block device, much like a file
|
||||
descriptor on UNIX systems. Requests to the block device are asynchronous and
|
||||
represented by spdk_bdev_io objects. Requests must be submitted on an
|
||||
associated I/O channel. The motivation and design of I/O channels is described
|
||||
in @ref concurrency.
|
||||
|
||||
Bdevs can be layered, such that some bdevs service I/O by routing requests to
|
||||
other bdevs. This can be used to implement caching, RAID, logical volume
|
||||
management, and more. Bdevs that route I/O to other bdevs are often referred
|
||||
to as virtual bdevs, or *vbdevs* for short.
|
||||
|
||||
## Initializing The Library
|
||||
|
||||
The bdev layer depends on the generic message passing infrastructure
|
||||
abstracted by the header file include/io_channel.h. See @ref concurrency for a
|
||||
full description. Most importantly, calls into the bdev library may only be
|
||||
made from threads that have been allocated with SPDK by calling
|
||||
spdk_allocate_thread().
|
||||
|
||||
From an allocated thread, the bdev library may be initialized by calling
|
||||
spdk_bdev_initialize(), which is an asynchronous operation. Until the completion
|
||||
callback is called, no other bdev library functions may be invoked. Similarly,
|
||||
to tear down the bdev library, call spdk_bdev_finish.
|
||||
|
||||
## Discovering Block Devices
|
||||
|
||||
All block devices have a simple string name. At any time, a pointer to the
|
||||
device object can be obtained by calling spdk_bdev_get_by_name(), or the entire
|
||||
set of bdevs may be iterated using spdk_bdev_first() and spdk_bdev_next() and
|
||||
their variants.
|
||||
|
||||
Some block devices may also be given aliases, which are also string names.
|
||||
Aliases behave like symlinks - they can be used interchangeably with the real
|
||||
name to look up the block device.
|
||||
|
||||
## Preparing To Use A Block Device
|
||||
|
||||
In order to send I/O requests to a block device, it must first be opened by
|
||||
calling spdk_bdev_open(). This will return a descriptor. Multiple users may have
|
||||
a bdev open at the same time, and coordination of reads and writes between
|
||||
users must be handled by some higher level mechanism outside of the bdev
|
||||
layer. Opening a bdev with write permission may fail if a virtual bdev module
|
||||
has *claimed* the bdev. Virtual bdev modules implement logic like RAID or
|
||||
logical volume management and forward their I/O to lower level bdevs, so they
|
||||
mark these lower level bdevs as claimed to prevent outside users from issuing
|
||||
writes.
|
||||
|
||||
When a block device is opened, an optional callback and context can be
|
||||
provided that will be called if the underlying storage servicing the block
|
||||
device is removed. For example, the remove callback will be called on each
|
||||
open descriptor for a bdev backed by a physical NVMe SSD when the NVMe SSD is
|
||||
hot-unplugged. The callback can be thought of as a request to close the open
|
||||
descriptor so other memory may be freed. A bdev cannot be torn down while open
|
||||
descriptors exist, so it is highly recommended that a callback is provided.
|
||||
|
||||
When a user is done with a descriptor, they may release it by calling
|
||||
spdk_bdev_close().
|
||||
|
||||
Descriptors may be passed to and used from multiple threads simultaneously.
|
||||
However, for each thread a separate I/O channel must be obtained by calling
|
||||
spdk_bdev_get_io_channel(). This will allocate the necessary per-thread
|
||||
resources to submit I/O requests to the bdev without taking locks. To release
|
||||
a channel, call spdk_put_io_channel(). A descriptor cannot be closed until
|
||||
all associated channels have been destroyed.
|
||||
|
||||
## Sending I/O
|
||||
|
||||
Once a descriptor and a channel have been obtained, I/O may be sent by calling
|
||||
the various I/O submission functions such as spdk_bdev_read(). These calls each
|
||||
take a callback as an argument which will be called some time later with a
|
||||
handle to an spdk_bdev_io object. In response to that completion, the user
|
||||
must call spdk_free_bdev_io() to release the resources. Within this callback,
|
||||
the user may also use the functions spdk_bdev_io_get_nvme_status() and
|
||||
spdk_bdev_io_get_scsi_status() to obtain error information in the format of
|
||||
their choosing.
|
||||
|
||||
I/O submission is performed by calling functions such as spdk_bdev_read() or
|
||||
spdk_bdev_write(). These functions take as an argument a pointer to a region of
|
||||
memory or a scatter gather list describing memory that will be transferred to
|
||||
the block device. This memory must be allocated through spdk_dma_malloc() or
|
||||
its variants. For a full explanation of why the memory must come from a
|
||||
special allocation pool, see @ref memory. Where possible, data in memory will
|
||||
be *directly transferred to the block device* using
|
||||
[Direct Memory Access](https://en.wikipedia.org/wiki/Direct_memory_access).
|
||||
That means it is not copied.
|
||||
|
||||
All I/O submission functions are asynchronous and non-blocking. They will not
|
||||
block or stall the thread for any reason. However, the I/O submission
|
||||
functions may fail in one of two ways. First, they may fail immediately and
|
||||
return an error code. In that case, the provided callback will not be called.
|
||||
Second, they may fail asynchronously. In that case, the associated
|
||||
spdk_bdev_io will be passed to the callback and it will report error
|
||||
information.
|
||||
|
||||
Some I/O request types are optional and may not be supported by a given bdev.
|
||||
To query a bdev for the I/O request types it supports, call
|
||||
spdk_bdev_io_type_supported().
|
||||
|
||||
## Resetting A Block Device
|
||||
|
||||
In order to handle unexpected failure conditions, the bdev library provides a
|
||||
mechanism to perform a device reset by calling spdk_bdev_reset(). This will pass
|
||||
a message to every other thread for which an I/O channel exists for the bdev,
|
||||
pause it, then forward a reset request to the underlying bdev module and wait
|
||||
for completion. Upon completion, the I/O channels will resume and the reset
|
||||
will complete. The specific behavior inside the bdev module is
|
||||
module-specific. For example, NVMe devices will delete all queue pairs,
|
||||
perform an NVMe reset, then recreate the queue pairs and continue. Most
|
||||
importantly, regardless of device type, *all I/O outstanding to the block
|
||||
device will be completed prior to the reset completing.*
|
@ -25,6 +25,8 @@
|
||||
|
||||
# Programmer Guides {#general}
|
||||
|
||||
- @ref bdev_pg
|
||||
- @ref bdev_module
|
||||
- @ref directory_structure
|
||||
- [Public API header files](files.html)
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user