doc: Programmer's guide template and example for blobstore
Includes the proposed Blobstore Programmer's Guide that is also and example for other programming guides as well as a template that can be used to create future programming guides. Also removes the previous blob.md Change-Id: Iefbb58b8c3ab015bf8e0cd02ba2fbe6a86c1852c Signed-off-by: Paul Luse <paul.e.luse@intel.com> Reviewed-on: https://review.gerrithub.io/384118 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
This commit is contained in:
parent
315630a2fd
commit
da58800f21
503
doc/blob.md
503
doc/blob.md
@ -1,164 +1,257 @@
|
||||
# Blobstore {#blob}
|
||||
# Blobstore Programmer's Guide {#blob}
|
||||
|
||||
## Introduction
|
||||
# In this document {#blob_pg_toc}
|
||||
|
||||
The blobstore is a persistent, power-fail safe block allocator designed to be
|
||||
used as the local storage system backing a higher level storage service,
|
||||
typically in lieu of a traditional filesystem. These higher level services can
|
||||
be local databases or key/value stores (MySQL, RocksDB), they can be dedicated
|
||||
appliances (SAN, NAS), or distributed storage systems (ex. Ceph, Cassandra). It
|
||||
is not designed to be a general purpose filesystem, however, and it is
|
||||
intentionally not POSIX compliant. To avoid confusion, no reference to files or
|
||||
objects will be made at all, instead using the term 'blob'. The blobstore is
|
||||
designed to allow asynchronous, uncached, parallel reads and writes to groups
|
||||
of blocks on a block device called 'blobs'. Blobs are typically large,
|
||||
measured in at least hundreds of kilobytes, and are always a multiple of the
|
||||
underlying block size.
|
||||
* @ref blob_pg_audience
|
||||
* @ref blob_pg_intro
|
||||
* @ref blob_pg_theory
|
||||
* @ref blob_pg_design
|
||||
* @ref blob_pg_examples
|
||||
* @ref blob_pg_config
|
||||
* @ref blob_pg_component
|
||||
|
||||
The blobstore is designed primarily to run on "next generation" media, which
|
||||
means the device supports fast random reads _and_ writes, with no required
|
||||
background garbage collection. However, in practice the design will run well on
|
||||
NAND too. Absolutely no attempt will be made to make this efficient on spinning
|
||||
media.
|
||||
## Target Audience {#blob_pg_audience}
|
||||
|
||||
## Design Goals
|
||||
The programmer's guide is intended for developers authoring applications that utilize the SPDK Blobstore. It is
|
||||
intended to supplement the source code in providing an overall understanding of how to integrate Blobstore into
|
||||
an application as well as provide some high level insight into how Blobstore works behind the scenes. It is not
|
||||
intended to serve as a design document or an API reference and in some cases source code snippets and high level
|
||||
sequences will be discussed; for the latest source code reference refer to the [repo](https://github.com/spdk).
|
||||
|
||||
The blobstore is intended to solve a number of problems that local databases
|
||||
have when using traditional POSIX filesystems. These databases are assumed to
|
||||
'own' the entire storage device, to not need to track access times, and to
|
||||
require only a very simple directory hierarchy. These assumptions allow
|
||||
significant design optimizations over a traditional POSIX filesystem and block
|
||||
stack.
|
||||
## Introduction {#blob_pg_intro}
|
||||
|
||||
Asynchronous I/O can be an order of magnitude or more faster than synchronous
|
||||
I/O, and so solutions like
|
||||
[libaio](https://git.fedorahosted.org/cgit/libaio.git/) have become popular.
|
||||
However, libaio is [not actually
|
||||
asynchronous](http://www.scylladb.com/2016/02/09/qualifying-filesystems/) in
|
||||
all cases. The blobstore will provide truly asynchronous operations in all
|
||||
cases without any hidden locks or stalls.
|
||||
Blobstore is a persistent, power-fail safe block allocator designed to be used as the local storage system
|
||||
backing a higher level storage service, typically in lieu of a traditional filesystem. These higher level services
|
||||
can be local databases or key/value stores (MySQL, RocksDB), they can be dedicated appliances (SAN, NAS), or
|
||||
distributed storage systems (ex. Ceph, Cassandra). It is not designed to be a general purpose filesystem, however,
|
||||
and it is intentionally not POSIX compliant. To avoid confusion, we avoid references to files or objects instead
|
||||
using the term 'blob'. The Blobstore is designed to allow asynchronous, uncached, parallel reads and writes to
|
||||
groups of blocks on a block device called 'blobs'. Blobs are typically large, measured in at least hundreds of
|
||||
kilobytes, and are always a multiple of the underlying block size.
|
||||
|
||||
With the advent of NVMe, storage devices now have a hardware interface that
|
||||
allows for highly parallel I/O submission from many threads with no locks.
|
||||
Unfortunately, placement of data on a device requires some central coordination
|
||||
to avoid conflicts. The blobstore will separate operations that require
|
||||
coordination from operations that do not, and allow users to explictly
|
||||
associate I/O with channels. Operations on different channels happen in
|
||||
parallel, all the way down to the hardware, with no locks or coordination.
|
||||
The Blobstore is designed primarily to run on "next generation" media, which means the device supports fast random
|
||||
reads and writes, with no required background garbage collection. However, in practice the design will run well on
|
||||
NAND too.
|
||||
|
||||
As media access latency improves, strategies for in-memory caching are changing
|
||||
and often the kernel page cache is a bottleneck. Many databases have moved to
|
||||
opening files only in O_DIRECT mode, avoiding the page cache entirely, and
|
||||
writing their own caching layer. With the introduction of next generation media
|
||||
and its additional expected latency reductions, this strategy will become far
|
||||
more prevalent. To support this, the blobstore will perform no in-memory
|
||||
caching of data at all, essentially making all blob operations conceptually
|
||||
equivalent to O_DIRECT. This means the blobstore has similar restrictions to
|
||||
O_DIRECT where data can only be read or written in units of pages (4KiB),
|
||||
although memory alignment requirements are much less strict than O_DIRECT (the
|
||||
pages can even be composed of scattered buffers). We fully expect that DRAM
|
||||
caching will remain critical to performance, but leave the specifics of the
|
||||
cache design to higher layers.
|
||||
## Theory of Operation {#blob_pg_theory}
|
||||
|
||||
Storage devices pull data from host memory using a DMA engine, and those DMA
|
||||
engines operate on physical addresses and often introduce alignment
|
||||
restrictions. Further, to avoid data corruption, the data must not be paged out
|
||||
by the operating system while it is being transferred to disk. Traditionally,
|
||||
operating systems solve this problem either by copying user data into special
|
||||
kernel buffers that were allocated for this purpose and the I/O operations are
|
||||
performed to/from there, or taking locks to mark all user pages as locked and
|
||||
unmovable. Historically, the time to perform the copy or locking was
|
||||
inconsequential relative to the I/O time at the storage device, but that is
|
||||
simply no longer the case. The blobstore will instead provide zero copy,
|
||||
lockless read and write access to the device. To do this, memory to be used for
|
||||
blob data must be registered with the blobstore up front, preferably at
|
||||
application start and out of the I/O path, so that it can be pinned, the
|
||||
physical addresses can be determined, and the alignment requirements can be
|
||||
verified.
|
||||
### Abstractions:
|
||||
|
||||
Hardware devices are necessarily limited to some maximum queue depth. For NVMe
|
||||
devices that can be quite large (the spec allows up to 64k!), but is typically
|
||||
much smaller (128 - 1024 per queue). Under heavy load, databases may generate
|
||||
enough requests to exceed the hardware queue depth, which requires queueing in
|
||||
software. For operating systems this is often done in the generic block layer
|
||||
and may cause unexpected stalls or require locks. The blobstore will avoid this
|
||||
by simply failing requests with an appropriate error code when the queue is
|
||||
full. This allows the blobstore to easily stick to its commitment to never
|
||||
block, but may require the user to provide their own queueing layer.
|
||||
The Blobstore defines a hierarchy of storage abstractions as follows.
|
||||
|
||||
## The Basics
|
||||
* **Logical Block**: Logical blocks are exposed by the disk itself, which are numbered from 0 to N, where N is the
|
||||
number of blocks in the disk. A logical block is typically either 512B or 4KiB.
|
||||
* **Page**: A page is defined to be a fixed number of logical blocks defined at Blobstore creation time. The logical
|
||||
blocks that compose a page are always contiguous. Pages are also numbered from the beginning of the disk such
|
||||
that the first page worth of blocks is page 0, the second page is page 1, etc. A page is typically 4KiB in size,
|
||||
so this is either 8 or 1 logical blocks in practice. The SSD must be able to perform atomic reads and writes of
|
||||
at least the page size.
|
||||
* **Cluster**: A cluster is a fixed number of pages defined at Blobstore creation time. The pages that compose a cluster
|
||||
are always contiguous. Clusters are also numbered from the beginning of the disk, where cluster 0 is the first cluster
|
||||
worth of pages, cluster 1 is the second grouping of pages, etc. A cluster is typically 1MiB in size, or 256 pages.
|
||||
* **Blob**: A blob is an ordered list of clusters. Blobs are manipulated (created, sized, deleted, etc.) by the application
|
||||
and persist across power failures and reboots. Applications use a Blobstore provided identifier to access a particular blob.
|
||||
Blobs are read and written in units of pages by specifying an offset from the start of the blob. Applications can also
|
||||
store metadata in the form of key/value pairs with each blob which we'll refer to as xattrs (extended attributes).
|
||||
* **Blobstore**: An SSD which has been initialized by a Blobstore-based application is referred to as "a Blobstore." A
|
||||
Blobstore owns the entire underlying device which is made up of a private Blobstore metadata region and the collection of
|
||||
blobs as managed by the application.
|
||||
|
||||
The blobstore defines a hierarchy of three units of disk space. The smallest are
|
||||
the *logical blocks* exposed by the disk itself, which are numbered from 0 to N,
|
||||
where N is the number of blocks in the disk. A logical block is typically
|
||||
either 512B or 4KiB.
|
||||
### Atomicity
|
||||
|
||||
The blobstore defines a *page* to be a fixed number of logical blocks defined
|
||||
at blobstore creation time. The logical blocks that compose a page are
|
||||
contiguous. Pages are also numbered from the beginning of the disk such that
|
||||
the first page worth of blocks is page 0, the second page is page 1, etc. A
|
||||
page is typically 4KiB in size, so this is either 8 or 1 logical blocks in
|
||||
practice. The device must be able to perform atomic reads and writes of at
|
||||
least the page size.
|
||||
For all Blobstore operations regarding atomicity, there is a dependency on the underlying device to guarantee atomic
|
||||
operations of at least one page size. Atomicity here can refer to multiple operations:
|
||||
|
||||
The largest unit is a *cluster*, which is a fixed number of pages defined at
|
||||
blobstore creation time. The pages that compose a cluster are contiguous.
|
||||
Clusters are also numbered from the beginning of the disk, where cluster 0 is
|
||||
the first cluster worth of pages, cluster 1 is the second grouping of pages,
|
||||
etc. A cluster is typically 1MiB in size, or 256 pages.
|
||||
* **Data Writes**: For the case of data writes, the unit of atomicity is one page. Therefore if a write operation of
|
||||
greater than one page is underway and the system suffers a power failure, the data on media will be consistent at a page
|
||||
size granularity (if a single page were in the middle of being updated when power was lost, the data at that page location
|
||||
will be as it was prior to the start of the write operation following power restoration.)
|
||||
* **Blob Metadata Updates**: Each blob has its own set of metadata (xattrs, size, etc). For performance reasons, a copy of
|
||||
this metadata is kept in RAM and only synchronized with the on-disk version when the application makes an explicit call to
|
||||
do so, or when the Blobstore is unloaded. Therefore, setting of an xattr, for example is not consistent until the call to
|
||||
synchronize it (covered later) which is, however, performed atomically.
|
||||
* **Blobstore Metadata Updates**: Blobstore itself has its own metadata which, like per blob metadata, has a copy in both
|
||||
RAM and on-disk. Unlike the per blob metadata, however, the Blobstore metadata region is not made consistent via a blob
|
||||
synchronization call, it is only synchronized when the Blobstore is properly unloaded via API. Therefore, if the Blobstore
|
||||
metadata is updated (blob creation, deletion, resize, etc.) and not unloaded properly, it will need to perform some extra
|
||||
steps the next time it is loaded which will take a bit more time than it would have if shutdown cleanly, but there will be
|
||||
no inconsistencies.
|
||||
|
||||
On top of these three basic units, the blobstore defines three primitives. The
|
||||
most fundamental is the blob, where a blob is an ordered list of clusters plus
|
||||
an identifier. Blobs persist across power failures and reboots. The set of all
|
||||
blobs described by shared metadata is called the blobstore. I/O operations on
|
||||
blobs are submitted through a channel. Channels are tied to threads, but
|
||||
multiple threads can simultaneously submit I/O operations to the same blob on
|
||||
their own channels.
|
||||
### Callbacks
|
||||
|
||||
Blobs are read and written in units of pages by specifying an offset in the
|
||||
virtual blob address space. This offset is translated by first determining
|
||||
which cluster(s) are being accessed, and then translating to a set of logical
|
||||
blocks. This translation is done trivially using only basic math - there is no
|
||||
mapping data structure. Unlike read and write, blobs are resized in units of
|
||||
clusters.
|
||||
Blobstore is callback driven; in the event that any Blobstore API is unable to make forward progress it will
|
||||
not block but instead return control at that point and make a call to the callback function provided in the API, along with
|
||||
arguments, when the original call is completed. The callback will be made on the same thread that the call was made from, more on
|
||||
threads later. Some API, however, offer no callback arguments; in these cases the calls are fully synchronous. Examples of
|
||||
asynchronous calls that utilize callbacks include those that involve disk IO, for example, where some amount of polling
|
||||
is required before the IO is completed.
|
||||
|
||||
Blobs are described by their metadata which consists of a discontiguous set of
|
||||
pages stored in a reserved region on the disk. Each page of metadata is
|
||||
referred to as a *metadata page*. Blobs do not share metadata pages with other
|
||||
blobs, and in fact the design relies on the backing storage device supporting
|
||||
an atomic write unit greater than or equal to the page size. Most devices
|
||||
backed by NAND and next generation media support this atomic write capability,
|
||||
but often magnetic media does not.
|
||||
### Backend Support
|
||||
|
||||
The metadata region is fixed in size and defined upon creation of the
|
||||
blobstore. The size is configurable, but by default one page is allocated for
|
||||
each cluster. For 1MiB clusters and 4KiB pages, that results in 0.4% metadata
|
||||
overhead.
|
||||
Blobstore requires a backing storage device that can be integrated using the `bdev` layer, or by directly integrating a
|
||||
device driver to Blobstore. The blobstore performs operations on a backing block device by calling function pointers
|
||||
supplied to it at initialization time. For convenience, an implementation of these function pointers that route I/O
|
||||
to the bdev layer is available in `bdev_blob.c`. Alternatively, for example, the SPDK NVMe driver may be directly integrated
|
||||
bypassing a small amount of `bdev` layer overhead. These options will be discussed further in the upcoming section on examples.
|
||||
|
||||
## Conventions
|
||||
### Metadata Operations
|
||||
|
||||
Data formats on the device are specified in [Backus-Naur
|
||||
Form](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form). All data is
|
||||
stored on media in little-endian format. Unspecified data must be zeroed.
|
||||
Because Blobstore is designed to be lock-free, metadata operations need to be isolated to a single
|
||||
thread to avoid taking locks on in memory data structures that maintain data on the layout of definitions of blobs (along
|
||||
with other data). In Blobstore this is implemented as `the metadata thread` and is defined to be the thread on which the
|
||||
application makes metadata related calls on. It is up to the application to setup a separate thread to make these calls on
|
||||
and to assure that it does not mix relevant IO operations with metadata operations even if they are on separate threads.
|
||||
This will be discussed further in the Design Considerations section.
|
||||
|
||||
## Media Format
|
||||
### Threads
|
||||
|
||||
The blobstore owns the entire storage device. The device is divided into
|
||||
clusters starting from the beginning, such that cluster 0 begins at the first
|
||||
logical block.
|
||||
An application using Blobstore with the SPDK NVMe driver, for example, can support a variety of thread scenarios.
|
||||
The simplest would be a single threaded application where the application, the Blobstore code and the NVMe driver share a
|
||||
single core. In this case, the single thread would be used to submit both metadata operations as well as IO operations and
|
||||
it would be up to the application to assure that only one metadata operation is issued at a time and not intermingled with
|
||||
affected IO operations.
|
||||
|
||||
### Channels
|
||||
|
||||
Channels are an SPDK-wide abstraction and with Blobstore the best way to think about them is that they are
|
||||
required in order to do IO. The application will perform IO to the channel and channels are best thought of as being
|
||||
associated 1:1 with a thread.
|
||||
|
||||
### Blob Identifiers
|
||||
|
||||
When an application creates a blob, it does not provide a name as is the case with many other similar
|
||||
storage systems, instead it is returned a unique identifier by the Blobstore that it needs to use on subsequent APIs to
|
||||
perform operations on the Blobstore.
|
||||
|
||||
## Design Considerations {#blob_pg_design}
|
||||
|
||||
### Initialization Options
|
||||
|
||||
When the Blobstore is initialized, there are multiple configuration options to consider. The
|
||||
options and their defaults are:
|
||||
|
||||
* **Cluster Size**: By default, this value is 1MB. The cluster size is required to be a multiple of page size and should be
|
||||
selected based on the application’s usage model in terms of allocation. Recall that blobs are made of up clusters so when
|
||||
a blob is allocated/deallocated or changes in size, disk LBAs will be manipulated in groups of cluster size. If the
|
||||
application is expecting to deal with mainly very large (always multiple GB) blobs then it may make sense to change the
|
||||
cluster size to 1GB for example.
|
||||
* **Number of Metadata Pages**: By default, Blobstore will assume there can be as many clusters as there are metadata pages
|
||||
which is the worst case scenario in terms of metadata usage and can be overridden here however the space efficiency is
|
||||
not significant.
|
||||
* **Maximum Simultaneous Metadata Operations**: Determines how many internally pre-allocated memory structures are set
|
||||
aside for performing metadata operations. It is unlikely that changes to this value (default 32) would be desirable.
|
||||
* **Maximum Simultaneous Operations Per Channel**: Determines how many internally pre-allocated memory structures are set
|
||||
aside for channel operations. Changes to this value would be application dependent and best determined by both a knowledge
|
||||
of the typical usage model, an understanding of the types of SSDs being used and empirical data. The default is 512.
|
||||
* **Blobstore Type**: This field is a character array to be used by applications that need to identify whether the
|
||||
Blobstore found here is appropriate to claim or not. The default is NULL and unless the application is being deployed in
|
||||
an environment where multiple applications using the same disks are at risk of inadvertently using the wrong Blobstore, there
|
||||
is no need to set this value. It can, however, be set to any valid set of characters.
|
||||
|
||||
### Sub-page Sized Operations
|
||||
|
||||
Blobstore is only capable of doing page sized read/write operations. If the application
|
||||
requires finer granularity it will have to accommodate that itself.
|
||||
|
||||
### Threads
|
||||
|
||||
As mentioned earlier, Blobstore can share a single thread with an application or the application
|
||||
can define any number of threads, within resource constraints, that makes sense. The basic considerations that must be
|
||||
followed are:
|
||||
* Metadata operations (API with MD in the name) should be isolated from each other as there is no internal locking on the
|
||||
memory structures affected by these API.
|
||||
* Metadata operations should be isolated from conflicting IO operations (an example of a conflicting IO would be one that is
|
||||
reading/writing to an area of a blob that a metadata operation is deallocating).
|
||||
* Asynchronous callbacks will always take place on the calling thread.
|
||||
* No assumptions about IO ordering can be made regardless of how many or which threads were involved in the issuing.
|
||||
|
||||
### Data Buffer Memory
|
||||
|
||||
As with all SPDK based applications, Blobstore requires memory used for data buffers to be allocated
|
||||
with SPDK API.
|
||||
|
||||
### Error Handling
|
||||
|
||||
Asynchronous Blobstore callbacks all include an error number that should be checked; non-zero values
|
||||
indicate and error. Synchronous calls will typically return an error value if applicable.
|
||||
|
||||
### Asynchronous API
|
||||
|
||||
Asynchronous callbacks will return control not immediately, but at the point in execution where no
|
||||
more forward progress can be made without blocking. Therefore, no assumptions can be made be made about the progress of
|
||||
an asynchronous call until the callback has completed.
|
||||
|
||||
### Xattrs
|
||||
|
||||
Setting and removing of xattrs in Blobstore is a metadata operation, xattrs are stored in per blob metadata.
|
||||
Therefore, xattrs are not persisted until a blob synchronization call is made and completed. Having a step process for
|
||||
persisting per blob metadata allows for applications to perform batches of xattr updates, for example, with only one
|
||||
more expensive call to synchronize and persist the values.
|
||||
|
||||
### Synchronizing Metadata
|
||||
|
||||
As described earlier, there are two types of metadata in Blobstore, per blob and one global
|
||||
metadata for the Blobstore itself. Only the per blob metadata can be explicitly synchronized via API. The global
|
||||
metadata will be inconsistent during run-time and only synchronized on proper shutdown. The implication, however, of
|
||||
an improper shutdown is only a performance penalty on the next startup as the global metadata will need to be rebuilt
|
||||
based on a parsing of the per blob metadata. For consistent start times, it is important to always close down the Blobstore
|
||||
properly via API.
|
||||
|
||||
### Iterating Blobs
|
||||
|
||||
Multiple examples of how to iterate through the blobs are included in the sample code and tools.
|
||||
Worthy to note, however, if walking through the existing blobs via the iter API, if your application finds the blob its
|
||||
looking for it will either need to explicitly close it (because was opened internally by the Blobstore) or complete walking
|
||||
the full list.
|
||||
|
||||
### The Super Blob
|
||||
|
||||
The super blob is simply a single blob ID that can be stored as part of the global metadata to act
|
||||
as sort of a "root" blob. The application may choose to use this blob to store any information that it needs or finds
|
||||
relevant in understanding any kind of structure for what is on the Blobstore.
|
||||
|
||||
## Examples {#blob_pg_examples}
|
||||
|
||||
There are multiple examples of Blobstore usage in the [repo](https://github.com/spdk/spdk):
|
||||
|
||||
* **Hello World**: Actually named `hello_blob.c` this is a very basic example of a single threaded application that
|
||||
does nothing more than demonstrate the very basic API. Although Blobstore is optimized for NVMe, this example uses
|
||||
a RAM disk (malloc) back-end so that it can be executed easily in any development environment. The malloc back-end
|
||||
is a `bdev` module thus this example uses not on the SPDK Framework but the `bdev` layer as well.
|
||||
|
||||
* **Hello NVME Blob**: `hello_nvme_blob.c` is the non-bdev version of `hello_blob.c`and simply shows how an
|
||||
application can directly integrate Blobstore with the SPDK NVMe driver without using the `bdev` layer at all.
|
||||
|
||||
* **CLI**: The `blobcli.c` example is command line utility intended to not only serve as example code but as a test
|
||||
and development tool for Blobstore itself. It is also a simple single threaded application that relies on both the
|
||||
SPDK Framework and the `bdev` layer but offers multiple modes of operation to accomplish some real-world tasks. In
|
||||
command mode, it accepts single-shot commands which can be a little time consuming if there are many commands to
|
||||
get through as each one will take a few seconds waiting for DPDK initialization. It therefore has a shell mode that
|
||||
allows the developer to get to a `blob>` prompt and then very quickly interact with Blobstore with simple commands
|
||||
that include the ability to import/export blobs from/to regular files. Lastly there is a a scripting mode to automate
|
||||
a series of tasks, again, handy for development and/or test type activities.
|
||||
|
||||
## Configuration {#blob_pg_config}
|
||||
|
||||
Blobstore configuration options are described in the initialization options section under @ref blob_pg_design.
|
||||
|
||||
## Component Detail {#blob_pg_component}
|
||||
|
||||
The information in this section is not necessarily relevant to designing an application for use with Blobstore, but
|
||||
understanding a little more about the internals may be interesting and is also included here for those wanting to
|
||||
contribute to the Blobstore effort itself.
|
||||
|
||||
### Media Format
|
||||
|
||||
The Blobstore owns the entire storage device. The device is divided into clusters starting from the beginning, such
|
||||
that cluster 0 begins at the first logical block.
|
||||
|
||||
LBA 0 LBA N
|
||||
+-----------+-----------+-----+-----------+
|
||||
| Cluster 0 | Cluster 1 | ... | Cluster N |
|
||||
+-----------+-----------+-----+-----------+
|
||||
|
||||
Or in formal notation:
|
||||
|
||||
<media-format> ::= <cluster0> <cluster>*
|
||||
|
||||
|
||||
Cluster 0 is special and has the following format, where page 0
|
||||
is the first page of the cluster:
|
||||
Cluster 0 is special and has the following format, where page 0 is the first page of the cluster:
|
||||
|
||||
+--------+-------------------+
|
||||
| Page 0 | Page 1 ... Page N |
|
||||
@ -167,109 +260,71 @@ is the first page of the cluster:
|
||||
| Block | |
|
||||
+--------+-------------------+
|
||||
|
||||
Or formally:
|
||||
The super block is a single page located at the beginning of the partition. It contains basic information about
|
||||
the Blobstore. The metadata region is the remainder of cluster 0 and may extend to additional clusters. Refer
|
||||
to the latest srouce code for complete structural details of the super block and metadata region.
|
||||
|
||||
<cluster0> ::= <super-block> <metadata-region>
|
||||
Each blob is allocated a non-contiguous set of pages inside the metadata region for its metadata. These pages
|
||||
form a linked list. The first page in the list will be written in place on update, while all other pages will
|
||||
be written to fresh locations. This requires the backing device to support an atomic write size greater than
|
||||
or equal to the page size to guarantee that the operation is atomic. See the section on atomicity for details.
|
||||
|
||||
The super block is a single page located at the beginning of the partition.
|
||||
It contains basic information about the blobstore. The metadata region
|
||||
is the remainder of cluster 0 and may extend to additional clusters.
|
||||
### Sequences and Batches
|
||||
|
||||
<super-block> ::= <sb-version> <sb-len> <sb-super-blob> <sb-params>
|
||||
<sb-metadata-start> <sb-metadata-len>
|
||||
<sb-blobid-start> <sb-blobid-len> <crc>
|
||||
<sb-version> ::= u32
|
||||
<sb-len> ::= u32 # Length of this super block, in bytes. Starts from the
|
||||
# beginning of this structure.
|
||||
<sb-super-blob> ::= u64 # Special blobid set by the user that indicates where
|
||||
# their starting metadata resides.
|
||||
Internally Blobstore uses the concepts of sequences and batches to submit IO to the underlying device in either
|
||||
a serial fashion or in parallel, respectively. Both are defined using the following structure:
|
||||
|
||||
<sb-md-start> ::= u64 # Metadata start location, in pages
|
||||
<sb-md-len> ::= u64 # Metadata length, in pages
|
||||
<sb-blobid-start> ::= u32 # Start of bitmask of valid blobids (in pages)
|
||||
<sb-blobid-len> ::= u32 # Lenget of bitmask of valid blobids (in pages)
|
||||
<crc> ::= u32 # Crc for super block
|
||||
~~~{.sh}
|
||||
struct spdk_bs_request_set;
|
||||
~~~
|
||||
|
||||
The `<sb-params>` data contains parameters specified by the user when the blob
|
||||
store was initially formatted.
|
||||
These requests sets are basically bookkeeping mechanisms to help Blobstore efficiently deal will related groups
|
||||
of IO. They are an internal construct only and are pre-allocated on a per channel basis (channels were discussed
|
||||
earlier). They are removed from a channel associated linked list when the set (sequence or batch) is started and
|
||||
then returned to the list when completed.
|
||||
|
||||
<sb-params> ::= <sb-page-size> <sb-cluster-size> <sb-bs-type>
|
||||
<sb-page-size> ::= u32 # page size, in bytes.
|
||||
# Must be a multiple of the logical block size.
|
||||
# The implementation today requires this to be 4KiB.
|
||||
<sb-cluster-size> ::= u32 # Cluster size, in bytes.
|
||||
# Must be a multiple of the page size.
|
||||
<sb-bs-type> ::= char[16] # Blobstore type
|
||||
### Key Internal Structures
|
||||
|
||||
Each blob is allocated a non-contiguous set of pages inside the metadata region
|
||||
for its metadata. These pages form a linked list. The first page in the list
|
||||
will be written in place on update, while all other pages will be written to
|
||||
fresh locations. This requires the backing device to support an atomic write
|
||||
size greater than or equal to the page size to guarantee that the operation is
|
||||
atomic. See the section on atomicity for details.
|
||||
`blobstore.h` contains many of the key structures for the internal workings of Blobstore. Only a few notable ones
|
||||
are reviewed here. Note that `blobstore.h` is an internal header file, the header file for Blobstore that defines
|
||||
the public API is `blob.h`.
|
||||
|
||||
Each page is defined as:
|
||||
~~~{.sh}
|
||||
struct spdk_blob
|
||||
~~~
|
||||
This is an in-memory data structure that contains key elements like the blob identifier, it's current state and two
|
||||
copies of the mutable metadata for the blob; one copy is the current metadata and the other is the last copy written
|
||||
to disk.
|
||||
|
||||
<metadata-page> ::= <blob-id> <blob-sequence-num> <blob-descriptor>*
|
||||
<blob-next> <blob-crc>
|
||||
<blob-id> ::= u64 # The blob guid
|
||||
<blob-sequence-num> ::= u32 # The sequence number of this page in the linked
|
||||
# list.
|
||||
~~~{.sh}
|
||||
struct spdk_blob_mut_data
|
||||
~~~
|
||||
This is a per blob structure, included the `struct spdk_blob` struct that actually defines the blob itself. It has the
|
||||
specific information on size and makeup of the blob (ie how many clusters are allocated for this blob and which ones.)
|
||||
|
||||
<blob-descriptor> ::= <blob-descriptor-type> <blob-descriptor-length>
|
||||
<blob-descriptor-data>
|
||||
<blob-descriptor-type> ::= u8 # 0 means padding, 1 means "extent", 2 means
|
||||
# xattr, 3 means flags. The type
|
||||
# describes how to interpret the descriptor data.
|
||||
<blob-descriptor-length> ::= u32 # Length of the entire descriptor
|
||||
~~~{.sh}
|
||||
struct spdk_blob_store
|
||||
~~~
|
||||
This is the main in-memory structure for the entire Blobstore. It defines the global on disk metadata region and maintains
|
||||
information relevant to the entire system - initialization options such as cluster size, etc.
|
||||
|
||||
<blob-descriptor-data-padding> ::= u8
|
||||
~~~{.sh}
|
||||
struct spdk_bs_super_block
|
||||
~~~
|
||||
The super block is an on-disk structure that contains all of the relevant information that's in the in-memory Blobstore
|
||||
structure just discussed along with other elements one would expect to see here such as signature, version, checksum, etc.
|
||||
|
||||
<blob-descriptor-data-extent> ::= <extent-cluster-id> <extent-cluster-count>
|
||||
<extent-cluster-id> ::= u32 # The cluster id where this extent starts
|
||||
<extent-cluster-count> ::= u32 # The number of clusters in this extent
|
||||
### Code Layout and Common Conventions
|
||||
|
||||
<blob-descriptor-data-xattr> ::= <xattr-name-length> <xattr-value-length>
|
||||
<xattr-name> <xattr-value>
|
||||
<xattr-name-length> ::= u16
|
||||
<xattr-value-length> ::= u16
|
||||
<xattr-name> ::= u8*
|
||||
<xattr-value> ::= u8*
|
||||
In general, `Blobstore.c` is laid out with groups of related functions blocked together with descriptive comments. For
|
||||
example,
|
||||
|
||||
<blob-descriptor-data-flags> ::= <flags-invalid> <flags-data-ro> <flags-md-ro>
|
||||
~~~{.sh}
|
||||
/* START spdk_bs_md_delete_blob */
|
||||
< relevant functions to accomplish the deletion of a blob >
|
||||
/* END spdk_bs_md_delete_blob */
|
||||
~~~
|
||||
|
||||
<flags-invalid> ::= u64
|
||||
<flags-data-ro> ::= u64
|
||||
<flags-md-ro> ::= u64
|
||||
|
||||
<blob-next> ::= u32 # The offset into the metadata region that contains the
|
||||
# next page of metadata. 0 means no next page.
|
||||
<blob-crc> ::= u32 # CRC of the entire page
|
||||
|
||||
|
||||
Descriptors cannot span metadata pages.
|
||||
|
||||
## Atomicity
|
||||
|
||||
Metadata in the blobstore is cached and must be explicitly synced by the user.
|
||||
Data is not cached, however, so when a write completes the data can be
|
||||
considered durable if the metadata is synchronized. Metadata does not often
|
||||
change, and in fact only must be synchronized after these explicit operations:
|
||||
|
||||
* resize
|
||||
* set xattr
|
||||
* remove xattr
|
||||
|
||||
Any other operation will not dirty the metadata. Further, the metadata for each
|
||||
blob is independent of all of the others, so a synchronization operation is
|
||||
only needed on the specific blob that is dirty.
|
||||
|
||||
The metadata consists of a linked list of pages. Updates to the metadata are
|
||||
done by first writing page 2 through N to a new location, writing page 1 in
|
||||
place to atomically update the chain, and then erasing the remainder of the old
|
||||
chain. The vast majority of the time, blobs consist of just a single metadata
|
||||
page and so this operation is very efficient. For this scheme to work the write
|
||||
to the first page must be atomic, which requires hardware support from the
|
||||
backing device. For most, if not all, NVMe SSDs, an atomic write unit of 4KiB
|
||||
can be expected. Devices specify their atomic write unit in their NVMe identify
|
||||
data - specifically in the AWUN field.
|
||||
And for the most part the following conventions are followed throughout:
|
||||
* functions beginning with an underscore are called internally only
|
||||
* functions or variables with the letters `cpl` are related to set or callback completions
|
||||
|
@ -25,7 +25,11 @@
|
||||
- @ref blobfs
|
||||
- @ref jsonrpc
|
||||
|
||||
# Programmer Guides {#general}
|
||||
# Programmer Guides {#prog_guides}
|
||||
|
||||
- @ref blob
|
||||
|
||||
# General Information {#general}
|
||||
|
||||
- @ref bdev_pg
|
||||
- @ref bdev_module
|
||||
|
80
doc/template_pg.md
Normal file
80
doc/template_pg.md
Normal file
@ -0,0 +1,80 @@
|
||||
# ComponentName Programmer's Guide {#componentname_pg}
|
||||
|
||||
# In this document {#componentname_pg_toc}
|
||||
|
||||
@ref componentname_pg_audience
|
||||
@ref componentname_pg_intro
|
||||
@ref componentname_pg_theory
|
||||
@ref componentname_pg_design
|
||||
@ref componentname_pg_examples
|
||||
@ref componentname_pg_config
|
||||
@ref componentname_pg_component
|
||||
@ref componentname_pg_sequences
|
||||
|
||||
## Target Audience {#componentname_pg_audience}
|
||||
|
||||
This programmer's guide is intended for developers authoring applications that utilize the SPDK <COMPONENT NAME>. It is
|
||||
intended to supplement the source code to provide an overall understanding of how to integrate <COMPONENT NAME> into
|
||||
an application as well as provide some high level insight into how <COMPONENT NAME> works behind the scenes. It is not
|
||||
intended to serve as a design document or an API reference but in some cases source code snippets and high level
|
||||
sequences will be discussed. For the latest source code reference refer to the [repo](https://github.com/spdk).
|
||||
|
||||
## Introduction {#componentname_pg_intro}
|
||||
|
||||
Provide some high level description of what this component is, what it does and maybe why it exists. This shouldn't be
|
||||
a lengthy tutorial or commentary on storage in general or the goodness of SPDK but provide enough information to
|
||||
set the stage for someone about to write an application to integrate with this component. They won't be totally
|
||||
starting from scratch if they're at this point, they are by defintion a storage applicaiton developer if they are
|
||||
reading this guide.
|
||||
|
||||
## Theory of Operation {#componentname_pg_theory}
|
||||
|
||||
Create subsections here to drill down into the "how" this component works. This isn't a design section however so
|
||||
avoid getting into too many details, just hit the high level concepts that would leave the developer with a
|
||||
50K foot overview of the major elements/assumptions/concepts that should have some baseline knowledge about before
|
||||
they start writing code.
|
||||
|
||||
Some questions to consider when authoring this section:
|
||||
|
||||
* What are the basic primitives that this component exposes?
|
||||
* How are these primitives related to one another?
|
||||
* What are the threading rules when using these primitives?
|
||||
* What are the theoretical performance implications for different scaling vectors?
|
||||
* Are there any other documents or specifications that the user should be familiar with?
|
||||
* What are the intended use cases?
|
||||
|
||||
## Design Considerations {#componentname_pg_design}
|
||||
|
||||
Here is where you want to highlight things they need to think about in *their* design. If you have written test code
|
||||
for this module think about the things that you needed to go learn about to properly interact with this module. Think
|
||||
about how they need to consider initialization options, threading, limitations, any sort of quirky or non-obious
|
||||
interactions or module behaviors that might save them some time and effort by thinking about before they start their
|
||||
design.
|
||||
|
||||
## Examples {#componentname_pg_examples}
|
||||
|
||||
List all of the relevant examples we have in the repo that use this module and describe a little about what they do.
|
||||
|
||||
## Configuration {#componentname_pg_config}
|
||||
|
||||
This section should describe the mechanisms for configuring the component at a high level (i.e. you can configure it
|
||||
using a config file, or you can configure it using RPC calls over a unix domain socket). It should also talk about
|
||||
when you can configure it - i.e. at run time or only up front. For specifics about how the RPCs work or the config
|
||||
file format, link to the appropriate user guide instead of putting that information here.
|
||||
|
||||
## Component Detail {#componentname_pg_component}
|
||||
|
||||
This is where we can provide some design level detail if it makes sense for this module. We don't want to have
|
||||
design docs as part of SPDK, the overhead and maintenance is too much for open source. We do, however, want
|
||||
to provide some level of insight into the codebase to promote getting more people involved and understanding
|
||||
of what the design is all about. The PG is meant to help a developer write their own application but we
|
||||
can use this section, per module, to test out a way to build out some internal design info as well. I see
|
||||
this as including an overview of key structures, concepts, etc., of the module itself. So, intersting info
|
||||
not required to write an application using the module but maybe just enough to provide the next level of
|
||||
detail into what's behind the scenes to get someone more intertested in becoming a community contributor.
|
||||
|
||||
## Sequences {#componentname_pg_sequences}
|
||||
|
||||
If sequence diagrams makes sense for this module, use mscgen to create simple UML-style (they don't need to be 100%
|
||||
UML compliant) diagrams for the API that an application needs to interact with. Details internal to the component
|
||||
should not be included.
|
Loading…
Reference in New Issue
Block a user