doc: Programmer's guide template and example for blobstore

Includes the proposed Blobstore Programmer's Guide that is also
and example for other programming guides as well as a template
that can be used to create future programming guides.

Also removes the previous blob.md

Change-Id: Iefbb58b8c3ab015bf8e0cd02ba2fbe6a86c1852c
Signed-off-by: Paul Luse <paul.e.luse@intel.com>
Reviewed-on: https://review.gerrithub.io/384118
Tested-by: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
This commit is contained in:
Paul Luse 2017-10-27 10:16:03 -07:00 committed by Jim Harris
parent 315630a2fd
commit da58800f21
3 changed files with 364 additions and 225 deletions

View File

@ -1,164 +1,257 @@
# Blobstore {#blob}
# Blobstore Programmer's Guide {#blob}
## Introduction
# In this document {#blob_pg_toc}
The blobstore is a persistent, power-fail safe block allocator designed to be
used as the local storage system backing a higher level storage service,
typically in lieu of a traditional filesystem. These higher level services can
be local databases or key/value stores (MySQL, RocksDB), they can be dedicated
appliances (SAN, NAS), or distributed storage systems (ex. Ceph, Cassandra). It
is not designed to be a general purpose filesystem, however, and it is
intentionally not POSIX compliant. To avoid confusion, no reference to files or
objects will be made at all, instead using the term 'blob'. The blobstore is
designed to allow asynchronous, uncached, parallel reads and writes to groups
of blocks on a block device called 'blobs'. Blobs are typically large,
measured in at least hundreds of kilobytes, and are always a multiple of the
underlying block size.
* @ref blob_pg_audience
* @ref blob_pg_intro
* @ref blob_pg_theory
* @ref blob_pg_design
* @ref blob_pg_examples
* @ref blob_pg_config
* @ref blob_pg_component
The blobstore is designed primarily to run on "next generation" media, which
means the device supports fast random reads _and_ writes, with no required
background garbage collection. However, in practice the design will run well on
NAND too. Absolutely no attempt will be made to make this efficient on spinning
media.
## Target Audience {#blob_pg_audience}
## Design Goals
The programmer's guide is intended for developers authoring applications that utilize the SPDK Blobstore. It is
intended to supplement the source code in providing an overall understanding of how to integrate Blobstore into
an application as well as provide some high level insight into how Blobstore works behind the scenes. It is not
intended to serve as a design document or an API reference and in some cases source code snippets and high level
sequences will be discussed; for the latest source code reference refer to the [repo](https://github.com/spdk).
The blobstore is intended to solve a number of problems that local databases
have when using traditional POSIX filesystems. These databases are assumed to
'own' the entire storage device, to not need to track access times, and to
require only a very simple directory hierarchy. These assumptions allow
significant design optimizations over a traditional POSIX filesystem and block
stack.
## Introduction {#blob_pg_intro}
Asynchronous I/O can be an order of magnitude or more faster than synchronous
I/O, and so solutions like
[libaio](https://git.fedorahosted.org/cgit/libaio.git/) have become popular.
However, libaio is [not actually
asynchronous](http://www.scylladb.com/2016/02/09/qualifying-filesystems/) in
all cases. The blobstore will provide truly asynchronous operations in all
cases without any hidden locks or stalls.
Blobstore is a persistent, power-fail safe block allocator designed to be used as the local storage system
backing a higher level storage service, typically in lieu of a traditional filesystem. These higher level services
can be local databases or key/value stores (MySQL, RocksDB), they can be dedicated appliances (SAN, NAS), or
distributed storage systems (ex. Ceph, Cassandra). It is not designed to be a general purpose filesystem, however,
and it is intentionally not POSIX compliant. To avoid confusion, we avoid references to files or objects instead
using the term 'blob'. The Blobstore is designed to allow asynchronous, uncached, parallel reads and writes to
groups of blocks on a block device called 'blobs'. Blobs are typically large, measured in at least hundreds of
kilobytes, and are always a multiple of the underlying block size.
With the advent of NVMe, storage devices now have a hardware interface that
allows for highly parallel I/O submission from many threads with no locks.
Unfortunately, placement of data on a device requires some central coordination
to avoid conflicts. The blobstore will separate operations that require
coordination from operations that do not, and allow users to explictly
associate I/O with channels. Operations on different channels happen in
parallel, all the way down to the hardware, with no locks or coordination.
The Blobstore is designed primarily to run on "next generation" media, which means the device supports fast random
reads and writes, with no required background garbage collection. However, in practice the design will run well on
NAND too.
As media access latency improves, strategies for in-memory caching are changing
and often the kernel page cache is a bottleneck. Many databases have moved to
opening files only in O_DIRECT mode, avoiding the page cache entirely, and
writing their own caching layer. With the introduction of next generation media
and its additional expected latency reductions, this strategy will become far
more prevalent. To support this, the blobstore will perform no in-memory
caching of data at all, essentially making all blob operations conceptually
equivalent to O_DIRECT. This means the blobstore has similar restrictions to
O_DIRECT where data can only be read or written in units of pages (4KiB),
although memory alignment requirements are much less strict than O_DIRECT (the
pages can even be composed of scattered buffers). We fully expect that DRAM
caching will remain critical to performance, but leave the specifics of the
cache design to higher layers.
## Theory of Operation {#blob_pg_theory}
Storage devices pull data from host memory using a DMA engine, and those DMA
engines operate on physical addresses and often introduce alignment
restrictions. Further, to avoid data corruption, the data must not be paged out
by the operating system while it is being transferred to disk. Traditionally,
operating systems solve this problem either by copying user data into special
kernel buffers that were allocated for this purpose and the I/O operations are
performed to/from there, or taking locks to mark all user pages as locked and
unmovable. Historically, the time to perform the copy or locking was
inconsequential relative to the I/O time at the storage device, but that is
simply no longer the case. The blobstore will instead provide zero copy,
lockless read and write access to the device. To do this, memory to be used for
blob data must be registered with the blobstore up front, preferably at
application start and out of the I/O path, so that it can be pinned, the
physical addresses can be determined, and the alignment requirements can be
verified.
### Abstractions:
Hardware devices are necessarily limited to some maximum queue depth. For NVMe
devices that can be quite large (the spec allows up to 64k!), but is typically
much smaller (128 - 1024 per queue). Under heavy load, databases may generate
enough requests to exceed the hardware queue depth, which requires queueing in
software. For operating systems this is often done in the generic block layer
and may cause unexpected stalls or require locks. The blobstore will avoid this
by simply failing requests with an appropriate error code when the queue is
full. This allows the blobstore to easily stick to its commitment to never
block, but may require the user to provide their own queueing layer.
The Blobstore defines a hierarchy of storage abstractions as follows.
## The Basics
* **Logical Block**: Logical blocks are exposed by the disk itself, which are numbered from 0 to N, where N is the
number of blocks in the disk. A logical block is typically either 512B or 4KiB.
* **Page**: A page is defined to be a fixed number of logical blocks defined at Blobstore creation time. The logical
blocks that compose a page are always contiguous. Pages are also numbered from the beginning of the disk such
that the first page worth of blocks is page 0, the second page is page 1, etc. A page is typically 4KiB in size,
so this is either 8 or 1 logical blocks in practice. The SSD must be able to perform atomic reads and writes of
at least the page size.
* **Cluster**: A cluster is a fixed number of pages defined at Blobstore creation time. The pages that compose a cluster
are always contiguous. Clusters are also numbered from the beginning of the disk, where cluster 0 is the first cluster
worth of pages, cluster 1 is the second grouping of pages, etc. A cluster is typically 1MiB in size, or 256 pages.
* **Blob**: A blob is an ordered list of clusters. Blobs are manipulated (created, sized, deleted, etc.) by the application
and persist across power failures and reboots. Applications use a Blobstore provided identifier to access a particular blob.
Blobs are read and written in units of pages by specifying an offset from the start of the blob. Applications can also
store metadata in the form of key/value pairs with each blob which we'll refer to as xattrs (extended attributes).
* **Blobstore**: An SSD which has been initialized by a Blobstore-based application is referred to as "a Blobstore." A
Blobstore owns the entire underlying device which is made up of a private Blobstore metadata region and the collection of
blobs as managed by the application.
The blobstore defines a hierarchy of three units of disk space. The smallest are
the *logical blocks* exposed by the disk itself, which are numbered from 0 to N,
where N is the number of blocks in the disk. A logical block is typically
either 512B or 4KiB.
### Atomicity
The blobstore defines a *page* to be a fixed number of logical blocks defined
at blobstore creation time. The logical blocks that compose a page are
contiguous. Pages are also numbered from the beginning of the disk such that
the first page worth of blocks is page 0, the second page is page 1, etc. A
page is typically 4KiB in size, so this is either 8 or 1 logical blocks in
practice. The device must be able to perform atomic reads and writes of at
least the page size.
For all Blobstore operations regarding atomicity, there is a dependency on the underlying device to guarantee atomic
operations of at least one page size. Atomicity here can refer to multiple operations:
The largest unit is a *cluster*, which is a fixed number of pages defined at
blobstore creation time. The pages that compose a cluster are contiguous.
Clusters are also numbered from the beginning of the disk, where cluster 0 is
the first cluster worth of pages, cluster 1 is the second grouping of pages,
etc. A cluster is typically 1MiB in size, or 256 pages.
* **Data Writes**: For the case of data writes, the unit of atomicity is one page. Therefore if a write operation of
greater than one page is underway and the system suffers a power failure, the data on media will be consistent at a page
size granularity (if a single page were in the middle of being updated when power was lost, the data at that page location
will be as it was prior to the start of the write operation following power restoration.)
* **Blob Metadata Updates**: Each blob has its own set of metadata (xattrs, size, etc). For performance reasons, a copy of
this metadata is kept in RAM and only synchronized with the on-disk version when the application makes an explicit call to
do so, or when the Blobstore is unloaded. Therefore, setting of an xattr, for example is not consistent until the call to
synchronize it (covered later) which is, however, performed atomically.
* **Blobstore Metadata Updates**: Blobstore itself has its own metadata which, like per blob metadata, has a copy in both
RAM and on-disk. Unlike the per blob metadata, however, the Blobstore metadata region is not made consistent via a blob
synchronization call, it is only synchronized when the Blobstore is properly unloaded via API. Therefore, if the Blobstore
metadata is updated (blob creation, deletion, resize, etc.) and not unloaded properly, it will need to perform some extra
steps the next time it is loaded which will take a bit more time than it would have if shutdown cleanly, but there will be
no inconsistencies.
On top of these three basic units, the blobstore defines three primitives. The
most fundamental is the blob, where a blob is an ordered list of clusters plus
an identifier. Blobs persist across power failures and reboots. The set of all
blobs described by shared metadata is called the blobstore. I/O operations on
blobs are submitted through a channel. Channels are tied to threads, but
multiple threads can simultaneously submit I/O operations to the same blob on
their own channels.
### Callbacks
Blobs are read and written in units of pages by specifying an offset in the
virtual blob address space. This offset is translated by first determining
which cluster(s) are being accessed, and then translating to a set of logical
blocks. This translation is done trivially using only basic math - there is no
mapping data structure. Unlike read and write, blobs are resized in units of
clusters.
Blobstore is callback driven; in the event that any Blobstore API is unable to make forward progress it will
not block but instead return control at that point and make a call to the callback function provided in the API, along with
arguments, when the original call is completed. The callback will be made on the same thread that the call was made from, more on
threads later. Some API, however, offer no callback arguments; in these cases the calls are fully synchronous. Examples of
asynchronous calls that utilize callbacks include those that involve disk IO, for example, where some amount of polling
is required before the IO is completed.
Blobs are described by their metadata which consists of a discontiguous set of
pages stored in a reserved region on the disk. Each page of metadata is
referred to as a *metadata page*. Blobs do not share metadata pages with other
blobs, and in fact the design relies on the backing storage device supporting
an atomic write unit greater than or equal to the page size. Most devices
backed by NAND and next generation media support this atomic write capability,
but often magnetic media does not.
### Backend Support
The metadata region is fixed in size and defined upon creation of the
blobstore. The size is configurable, but by default one page is allocated for
each cluster. For 1MiB clusters and 4KiB pages, that results in 0.4% metadata
overhead.
Blobstore requires a backing storage device that can be integrated using the `bdev` layer, or by directly integrating a
device driver to Blobstore. The blobstore performs operations on a backing block device by calling function pointers
supplied to it at initialization time. For convenience, an implementation of these function pointers that route I/O
to the bdev layer is available in `bdev_blob.c`. Alternatively, for example, the SPDK NVMe driver may be directly integrated
bypassing a small amount of `bdev` layer overhead. These options will be discussed further in the upcoming section on examples.
## Conventions
### Metadata Operations
Data formats on the device are specified in [Backus-Naur
Form](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form). All data is
stored on media in little-endian format. Unspecified data must be zeroed.
Because Blobstore is designed to be lock-free, metadata operations need to be isolated to a single
thread to avoid taking locks on in memory data structures that maintain data on the layout of definitions of blobs (along
with other data). In Blobstore this is implemented as `the metadata thread` and is defined to be the thread on which the
application makes metadata related calls on. It is up to the application to setup a separate thread to make these calls on
and to assure that it does not mix relevant IO operations with metadata operations even if they are on separate threads.
This will be discussed further in the Design Considerations section.
## Media Format
### Threads
The blobstore owns the entire storage device. The device is divided into
clusters starting from the beginning, such that cluster 0 begins at the first
logical block.
An application using Blobstore with the SPDK NVMe driver, for example, can support a variety of thread scenarios.
The simplest would be a single threaded application where the application, the Blobstore code and the NVMe driver share a
single core. In this case, the single thread would be used to submit both metadata operations as well as IO operations and
it would be up to the application to assure that only one metadata operation is issued at a time and not intermingled with
affected IO operations.
### Channels
Channels are an SPDK-wide abstraction and with Blobstore the best way to think about them is that they are
required in order to do IO. The application will perform IO to the channel and channels are best thought of as being
associated 1:1 with a thread.
### Blob Identifiers
When an application creates a blob, it does not provide a name as is the case with many other similar
storage systems, instead it is returned a unique identifier by the Blobstore that it needs to use on subsequent APIs to
perform operations on the Blobstore.
## Design Considerations {#blob_pg_design}
### Initialization Options
When the Blobstore is initialized, there are multiple configuration options to consider. The
options and their defaults are:
* **Cluster Size**: By default, this value is 1MB. The cluster size is required to be a multiple of page size and should be
selected based on the applications usage model in terms of allocation. Recall that blobs are made of up clusters so when
a blob is allocated/deallocated or changes in size, disk LBAs will be manipulated in groups of cluster size. If the
application is expecting to deal with mainly very large (always multiple GB) blobs then it may make sense to change the
cluster size to 1GB for example.
* **Number of Metadata Pages**: By default, Blobstore will assume there can be as many clusters as there are metadata pages
which is the worst case scenario in terms of metadata usage and can be overridden here however the space efficiency is
not significant.
* **Maximum Simultaneous Metadata Operations**: Determines how many internally pre-allocated memory structures are set
aside for performing metadata operations. It is unlikely that changes to this value (default 32) would be desirable.
* **Maximum Simultaneous Operations Per Channel**: Determines how many internally pre-allocated memory structures are set
aside for channel operations. Changes to this value would be application dependent and best determined by both a knowledge
of the typical usage model, an understanding of the types of SSDs being used and empirical data. The default is 512.
* **Blobstore Type**: This field is a character array to be used by applications that need to identify whether the
Blobstore found here is appropriate to claim or not. The default is NULL and unless the application is being deployed in
an environment where multiple applications using the same disks are at risk of inadvertently using the wrong Blobstore, there
is no need to set this value. It can, however, be set to any valid set of characters.
### Sub-page Sized Operations
Blobstore is only capable of doing page sized read/write operations. If the application
requires finer granularity it will have to accommodate that itself.
### Threads
As mentioned earlier, Blobstore can share a single thread with an application or the application
can define any number of threads, within resource constraints, that makes sense. The basic considerations that must be
followed are:
* Metadata operations (API with MD in the name) should be isolated from each other as there is no internal locking on the
memory structures affected by these API.
* Metadata operations should be isolated from conflicting IO operations (an example of a conflicting IO would be one that is
reading/writing to an area of a blob that a metadata operation is deallocating).
* Asynchronous callbacks will always take place on the calling thread.
* No assumptions about IO ordering can be made regardless of how many or which threads were involved in the issuing.
### Data Buffer Memory
As with all SPDK based applications, Blobstore requires memory used for data buffers to be allocated
with SPDK API.
### Error Handling
Asynchronous Blobstore callbacks all include an error number that should be checked; non-zero values
indicate and error. Synchronous calls will typically return an error value if applicable.
### Asynchronous API
Asynchronous callbacks will return control not immediately, but at the point in execution where no
more forward progress can be made without blocking. Therefore, no assumptions can be made be made about the progress of
an asynchronous call until the callback has completed.
### Xattrs
Setting and removing of xattrs in Blobstore is a metadata operation, xattrs are stored in per blob metadata.
Therefore, xattrs are not persisted until a blob synchronization call is made and completed. Having a step process for
persisting per blob metadata allows for applications to perform batches of xattr updates, for example, with only one
more expensive call to synchronize and persist the values.
### Synchronizing Metadata
As described earlier, there are two types of metadata in Blobstore, per blob and one global
metadata for the Blobstore itself. Only the per blob metadata can be explicitly synchronized via API. The global
metadata will be inconsistent during run-time and only synchronized on proper shutdown. The implication, however, of
an improper shutdown is only a performance penalty on the next startup as the global metadata will need to be rebuilt
based on a parsing of the per blob metadata. For consistent start times, it is important to always close down the Blobstore
properly via API.
### Iterating Blobs
Multiple examples of how to iterate through the blobs are included in the sample code and tools.
Worthy to note, however, if walking through the existing blobs via the iter API, if your application finds the blob its
looking for it will either need to explicitly close it (because was opened internally by the Blobstore) or complete walking
the full list.
### The Super Blob
The super blob is simply a single blob ID that can be stored as part of the global metadata to act
as sort of a "root" blob. The application may choose to use this blob to store any information that it needs or finds
relevant in understanding any kind of structure for what is on the Blobstore.
## Examples {#blob_pg_examples}
There are multiple examples of Blobstore usage in the [repo](https://github.com/spdk/spdk):
* **Hello World**: Actually named `hello_blob.c` this is a very basic example of a single threaded application that
does nothing more than demonstrate the very basic API. Although Blobstore is optimized for NVMe, this example uses
a RAM disk (malloc) back-end so that it can be executed easily in any development environment. The malloc back-end
is a `bdev` module thus this example uses not on the SPDK Framework but the `bdev` layer as well.
* **Hello NVME Blob**: `hello_nvme_blob.c` is the non-bdev version of `hello_blob.c`and simply shows how an
application can directly integrate Blobstore with the SPDK NVMe driver without using the `bdev` layer at all.
* **CLI**: The `blobcli.c` example is command line utility intended to not only serve as example code but as a test
and development tool for Blobstore itself. It is also a simple single threaded application that relies on both the
SPDK Framework and the `bdev` layer but offers multiple modes of operation to accomplish some real-world tasks. In
command mode, it accepts single-shot commands which can be a little time consuming if there are many commands to
get through as each one will take a few seconds waiting for DPDK initialization. It therefore has a shell mode that
allows the developer to get to a `blob>` prompt and then very quickly interact with Blobstore with simple commands
that include the ability to import/export blobs from/to regular files. Lastly there is a a scripting mode to automate
a series of tasks, again, handy for development and/or test type activities.
## Configuration {#blob_pg_config}
Blobstore configuration options are described in the initialization options section under @ref blob_pg_design.
## Component Detail {#blob_pg_component}
The information in this section is not necessarily relevant to designing an application for use with Blobstore, but
understanding a little more about the internals may be interesting and is also included here for those wanting to
contribute to the Blobstore effort itself.
### Media Format
The Blobstore owns the entire storage device. The device is divided into clusters starting from the beginning, such
that cluster 0 begins at the first logical block.
LBA 0 LBA N
+-----------+-----------+-----+-----------+
| Cluster 0 | Cluster 1 | ... | Cluster N |
+-----------+-----------+-----+-----------+
Or in formal notation:
<media-format> ::= <cluster0> <cluster>*
Cluster 0 is special and has the following format, where page 0
is the first page of the cluster:
Cluster 0 is special and has the following format, where page 0 is the first page of the cluster:
+--------+-------------------+
| Page 0 | Page 1 ... Page N |
@ -167,109 +260,71 @@ is the first page of the cluster:
| Block | |
+--------+-------------------+
Or formally:
The super block is a single page located at the beginning of the partition. It contains basic information about
the Blobstore. The metadata region is the remainder of cluster 0 and may extend to additional clusters. Refer
to the latest srouce code for complete structural details of the super block and metadata region.
<cluster0> ::= <super-block> <metadata-region>
Each blob is allocated a non-contiguous set of pages inside the metadata region for its metadata. These pages
form a linked list. The first page in the list will be written in place on update, while all other pages will
be written to fresh locations. This requires the backing device to support an atomic write size greater than
or equal to the page size to guarantee that the operation is atomic. See the section on atomicity for details.
The super block is a single page located at the beginning of the partition.
It contains basic information about the blobstore. The metadata region
is the remainder of cluster 0 and may extend to additional clusters.
### Sequences and Batches
<super-block> ::= <sb-version> <sb-len> <sb-super-blob> <sb-params>
<sb-metadata-start> <sb-metadata-len>
<sb-blobid-start> <sb-blobid-len> <crc>
<sb-version> ::= u32
<sb-len> ::= u32 # Length of this super block, in bytes. Starts from the
# beginning of this structure.
<sb-super-blob> ::= u64 # Special blobid set by the user that indicates where
# their starting metadata resides.
Internally Blobstore uses the concepts of sequences and batches to submit IO to the underlying device in either
a serial fashion or in parallel, respectively. Both are defined using the following structure:
<sb-md-start> ::= u64 # Metadata start location, in pages
<sb-md-len> ::= u64 # Metadata length, in pages
<sb-blobid-start> ::= u32 # Start of bitmask of valid blobids (in pages)
<sb-blobid-len> ::= u32 # Lenget of bitmask of valid blobids (in pages)
<crc> ::= u32 # Crc for super block
~~~{.sh}
struct spdk_bs_request_set;
~~~
The `<sb-params>` data contains parameters specified by the user when the blob
store was initially formatted.
These requests sets are basically bookkeeping mechanisms to help Blobstore efficiently deal will related groups
of IO. They are an internal construct only and are pre-allocated on a per channel basis (channels were discussed
earlier). They are removed from a channel associated linked list when the set (sequence or batch) is started and
then returned to the list when completed.
<sb-params> ::= <sb-page-size> <sb-cluster-size> <sb-bs-type>
<sb-page-size> ::= u32 # page size, in bytes.
# Must be a multiple of the logical block size.
# The implementation today requires this to be 4KiB.
<sb-cluster-size> ::= u32 # Cluster size, in bytes.
# Must be a multiple of the page size.
<sb-bs-type> ::= char[16] # Blobstore type
### Key Internal Structures
Each blob is allocated a non-contiguous set of pages inside the metadata region
for its metadata. These pages form a linked list. The first page in the list
will be written in place on update, while all other pages will be written to
fresh locations. This requires the backing device to support an atomic write
size greater than or equal to the page size to guarantee that the operation is
atomic. See the section on atomicity for details.
`blobstore.h` contains many of the key structures for the internal workings of Blobstore. Only a few notable ones
are reviewed here. Note that `blobstore.h` is an internal header file, the header file for Blobstore that defines
the public API is `blob.h`.
Each page is defined as:
~~~{.sh}
struct spdk_blob
~~~
This is an in-memory data structure that contains key elements like the blob identifier, it's current state and two
copies of the mutable metadata for the blob; one copy is the current metadata and the other is the last copy written
to disk.
<metadata-page> ::= <blob-id> <blob-sequence-num> <blob-descriptor>*
<blob-next> <blob-crc>
<blob-id> ::= u64 # The blob guid
<blob-sequence-num> ::= u32 # The sequence number of this page in the linked
# list.
~~~{.sh}
struct spdk_blob_mut_data
~~~
This is a per blob structure, included the `struct spdk_blob` struct that actually defines the blob itself. It has the
specific information on size and makeup of the blob (ie how many clusters are allocated for this blob and which ones.)
<blob-descriptor> ::= <blob-descriptor-type> <blob-descriptor-length>
<blob-descriptor-data>
<blob-descriptor-type> ::= u8 # 0 means padding, 1 means "extent", 2 means
# xattr, 3 means flags. The type
# describes how to interpret the descriptor data.
<blob-descriptor-length> ::= u32 # Length of the entire descriptor
~~~{.sh}
struct spdk_blob_store
~~~
This is the main in-memory structure for the entire Blobstore. It defines the global on disk metadata region and maintains
information relevant to the entire system - initialization options such as cluster size, etc.
<blob-descriptor-data-padding> ::= u8
~~~{.sh}
struct spdk_bs_super_block
~~~
The super block is an on-disk structure that contains all of the relevant information that's in the in-memory Blobstore
structure just discussed along with other elements one would expect to see here such as signature, version, checksum, etc.
<blob-descriptor-data-extent> ::= <extent-cluster-id> <extent-cluster-count>
<extent-cluster-id> ::= u32 # The cluster id where this extent starts
<extent-cluster-count> ::= u32 # The number of clusters in this extent
### Code Layout and Common Conventions
<blob-descriptor-data-xattr> ::= <xattr-name-length> <xattr-value-length>
<xattr-name> <xattr-value>
<xattr-name-length> ::= u16
<xattr-value-length> ::= u16
<xattr-name> ::= u8*
<xattr-value> ::= u8*
In general, `Blobstore.c` is laid out with groups of related functions blocked together with descriptive comments. For
example,
<blob-descriptor-data-flags> ::= <flags-invalid> <flags-data-ro> <flags-md-ro>
~~~{.sh}
/* START spdk_bs_md_delete_blob */
< relevant functions to accomplish the deletion of a blob >
/* END spdk_bs_md_delete_blob */
~~~
<flags-invalid> ::= u64
<flags-data-ro> ::= u64
<flags-md-ro> ::= u64
<blob-next> ::= u32 # The offset into the metadata region that contains the
# next page of metadata. 0 means no next page.
<blob-crc> ::= u32 # CRC of the entire page
Descriptors cannot span metadata pages.
## Atomicity
Metadata in the blobstore is cached and must be explicitly synced by the user.
Data is not cached, however, so when a write completes the data can be
considered durable if the metadata is synchronized. Metadata does not often
change, and in fact only must be synchronized after these explicit operations:
* resize
* set xattr
* remove xattr
Any other operation will not dirty the metadata. Further, the metadata for each
blob is independent of all of the others, so a synchronization operation is
only needed on the specific blob that is dirty.
The metadata consists of a linked list of pages. Updates to the metadata are
done by first writing page 2 through N to a new location, writing page 1 in
place to atomically update the chain, and then erasing the remainder of the old
chain. The vast majority of the time, blobs consist of just a single metadata
page and so this operation is very efficient. For this scheme to work the write
to the first page must be atomic, which requires hardware support from the
backing device. For most, if not all, NVMe SSDs, an atomic write unit of 4KiB
can be expected. Devices specify their atomic write unit in their NVMe identify
data - specifically in the AWUN field.
And for the most part the following conventions are followed throughout:
* functions beginning with an underscore are called internally only
* functions or variables with the letters `cpl` are related to set or callback completions

View File

@ -25,7 +25,11 @@
- @ref blobfs
- @ref jsonrpc
# Programmer Guides {#general}
# Programmer Guides {#prog_guides}
- @ref blob
# General Information {#general}
- @ref bdev_pg
- @ref bdev_module

80
doc/template_pg.md Normal file
View File

@ -0,0 +1,80 @@
# ComponentName Programmer's Guide {#componentname_pg}
# In this document {#componentname_pg_toc}
@ref componentname_pg_audience
@ref componentname_pg_intro
@ref componentname_pg_theory
@ref componentname_pg_design
@ref componentname_pg_examples
@ref componentname_pg_config
@ref componentname_pg_component
@ref componentname_pg_sequences
## Target Audience {#componentname_pg_audience}
This programmer's guide is intended for developers authoring applications that utilize the SPDK <COMPONENT NAME>. It is
intended to supplement the source code to provide an overall understanding of how to integrate <COMPONENT NAME> into
an application as well as provide some high level insight into how <COMPONENT NAME> works behind the scenes. It is not
intended to serve as a design document or an API reference but in some cases source code snippets and high level
sequences will be discussed. For the latest source code reference refer to the [repo](https://github.com/spdk).
## Introduction {#componentname_pg_intro}
Provide some high level description of what this component is, what it does and maybe why it exists. This shouldn't be
a lengthy tutorial or commentary on storage in general or the goodness of SPDK but provide enough information to
set the stage for someone about to write an application to integrate with this component. They won't be totally
starting from scratch if they're at this point, they are by defintion a storage applicaiton developer if they are
reading this guide.
## Theory of Operation {#componentname_pg_theory}
Create subsections here to drill down into the "how" this component works. This isn't a design section however so
avoid getting into too many details, just hit the high level concepts that would leave the developer with a
50K foot overview of the major elements/assumptions/concepts that should have some baseline knowledge about before
they start writing code.
Some questions to consider when authoring this section:
* What are the basic primitives that this component exposes?
* How are these primitives related to one another?
* What are the threading rules when using these primitives?
* What are the theoretical performance implications for different scaling vectors?
* Are there any other documents or specifications that the user should be familiar with?
* What are the intended use cases?
## Design Considerations {#componentname_pg_design}
Here is where you want to highlight things they need to think about in *their* design. If you have written test code
for this module think about the things that you needed to go learn about to properly interact with this module. Think
about how they need to consider initialization options, threading, limitations, any sort of quirky or non-obious
interactions or module behaviors that might save them some time and effort by thinking about before they start their
design.
## Examples {#componentname_pg_examples}
List all of the relevant examples we have in the repo that use this module and describe a little about what they do.
## Configuration {#componentname_pg_config}
This section should describe the mechanisms for configuring the component at a high level (i.e. you can configure it
using a config file, or you can configure it using RPC calls over a unix domain socket). It should also talk about
when you can configure it - i.e. at run time or only up front. For specifics about how the RPCs work or the config
file format, link to the appropriate user guide instead of putting that information here.
## Component Detail {#componentname_pg_component}
This is where we can provide some design level detail if it makes sense for this module. We don't want to have
design docs as part of SPDK, the overhead and maintenance is too much for open source. We do, however, want
to provide some level of insight into the codebase to promote getting more people involved and understanding
of what the design is all about. The PG is meant to help a developer write their own application but we
can use this section, per module, to test out a way to build out some internal design info as well. I see
this as including an overview of key structures, concepts, etc., of the module itself. So, intersting info
not required to write an application using the module but maybe just enough to provide the next level of
detail into what's behind the scenes to get someone more intertested in becoming a community contributor.
## Sequences {#componentname_pg_sequences}
If sequence diagrams makes sense for this module, use mscgen to create simple UML-style (they don't need to be 100%
UML compliant) diagrams for the API that an application needs to interact with. Details internal to the component
should not be included.