From fcb333e5b539e9dd541f1f355db01babab61a0ee Mon Sep 17 00:00:00 2001
From: Ben Walker <benjamin.walker@intel.com>
Date: Mon, 20 Mar 2017 14:21:12 -0700
Subject: [PATCH] blob: Add a design document

This outlines the basic concepts in the blobstore.

Change-Id: Ib44c3bd5c04ddb0cc06eeb0e50ae6ff1bdf94dd5
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
---
 doc/Doxyfile      |   1 +
 doc/blob/index.md | 277 ++++++++++++++++++++++++++++++++++++++++++++++
 doc/index.md      |   1 +
 3 files changed, 279 insertions(+)
 create mode 100644 doc/blob/index.md

diff --git a/doc/Doxyfile b/doc/Doxyfile
index 9ba9aa281..76f03812a 100644
--- a/doc/Doxyfile
+++ b/doc/Doxyfile
@@ -764,6 +764,7 @@ INPUT                  = ../include/spdk \
                          porting.md \
                          bdev/index.md \
                          bdev/getting_started.md \
+                         blob/index.md \
                          blobfs/index.md \
                          blobfs/getting_started.md \
                          event/index.md \
diff --git a/doc/blob/index.md b/doc/blob/index.md
new file mode 100644
index 000000000..71f20416f
--- /dev/null
+++ b/doc/blob/index.md
@@ -0,0 +1,277 @@
+# Blobstore {#blob}
+
+## Introduction
+
+The blobstore is a persistent, power-fail safe block allocator designed to be
+used as the local storage system backing a higher level storage service,
+typically in lieu of a traditional filesystem. These higher level services can
+be local databases or key/value stores (MySQL, RocksDB), they can be dedicated
+appliances (SAN, NAS), or distributed storage systems (ex. Ceph, Cassandra). It
+is not designed to be a general purpose filesystem, however, and it is
+intentionally not POSIX compliant. To avoid confusion, no reference to files or
+objects will be made at all, instead using the term 'blob'. The blobstore is
+designed to allow asynchronous, uncached, parallel reads and writes to groups
+of blocks on a block device called 'blobs'. Blobs are typically large,
+measured in at least hundreds of kilobytes, and are always a multiple of the
+underlying block size.
+
+The blobstore is designed primarily to run on "next generation" media, which
+means the device supports fast random reads _and_ writes, with no required
+background garbage collection. However, in practice the design will run well on
+NAND too. Absolutely no attempt will be made to make this efficient on spinning
+media.
+
+## Design Goals
+
+The blobstore is intended to solve a number of problems that local databases
+have when using traditional POSIX filesystems. These databases are assumed to
+'own' the entire storage device, to not need to track access times, and to
+require only a very simple directory hierarchy. These assumptions allow
+significant design optimizations over a traditional POSIX filesystem and block
+stack.
+
+Asynchronous I/O can be an order of magnitude or more faster than synchronous
+I/O, and so solutions like
+[libaio](https://git.fedorahosted.org/cgit/libaio.git/) have become popular.
+However, libaio is [not actually
+asynchronous](http://www.scylladb.com/2016/02/09/qualifying-filesystems/) in
+all cases. The blobstore will provide truly asynchronous operations in all
+cases without any hidden locks or stalls.
+
+With the advent of NVMe, storage devices now have a hardware interface that
+allows for highly parallel I/O submission from many threads with no locks.
+Unfortunately, placement of data on a device requires some central coordination
+to avoid conflicts. The blobstore will separate operations that require
+coordination from operations that do not, and allow users to explictly
+associate I/O with channels. Operations on different channels happen in
+parallel, all the way down to the hardware, with no locks or coordination.
+
+As media access latency improves, strategies for in-memory caching are changing
+and often the kernel page cache is a bottleneck. Many databases have moved to
+opening files only in O_DIRECT mode, avoiding the page cache entirely, and
+writing their own caching layer. With the introduction of next generation media
+and its additional expected latency reductions, this strategy will become far
+more prevalent. To support this, the blobstore will perform no in-memory
+caching of data at all, essentially making all blob operations conceptually
+equivalent to O_DIRECT. This means the blobstore has similar restrictions to
+O_DIRECT where data can only be read or written in units of pages (4KiB),
+although memory alignment requirements are much less strict than O_DIRECT (the
+pages can even be composed of scattered buffers). We fully expect that DRAM
+caching will remain critical to performance, but leave the specifics of the
+cache design to higher layers.
+
+Storage devices pull data from host memory using a DMA engine, and those DMA
+engines operate on physical addresses and often introduce alignment
+restrictions. Further, to avoid data corruption, the data must not be paged out
+by the operating system while it is being transferred to disk. Traditionally,
+operating systems solve this problem either by copying user data into special
+kernel buffers that were allocated for this purpose and the I/O operations are
+performed to/from there, or taking locks to mark all user pages as locked and
+unmovable. Historically, the time to perform the copy or locking was
+inconsequential relative to the I/O time at the storage device, but that is
+simply no longer the case. The blobstore will instead provide zero copy,
+lockless read and write access to the device. To do this, memory to be used for
+blob data must be registered with the blobstore up front, preferably at
+application start and out of the I/O path, so that it can be pinned, the
+physical addresses can be determined, and the alignment requirements can be
+verified.
+
+Hardware devices are necessarily limited to some maximum queue depth. For NVMe
+devices that can be quite large (the spec allows up to 64k!), but is typically
+much smaller (128 - 1024 per queue). Under heavy load, databases may generate
+enough requests to exceed the hardware queue depth, which requires queueing in
+software. For operating systems this is often done in the generic block layer
+and may cause unexpected stalls or require locks. The blobstore will avoid this
+by simply failing requests with an appropriate error code when the queue is
+full. This allows the blobstore to easily stick to its commitment to never
+block, but may require the user to provide their own queueing layer.
+
+The NVMe specification has added support for specifying priorities on the
+hardware queues. With a traditional filesystem and storage stack, however,
+there is no reasonable way to map an I/O from an arbitrary thread to a
+particular hardware queue to be processed with the priority requested. The
+blobstore solves this by allowing the user to create channels with priorities,
+which map directly to priorities on NVMe hardware queues. The user can then
+choose the priority for an I/O by sending it on the appropriate channel. This
+is incredibly useful for many databases where data intake operations need to
+run with a much higher priority than background scrub and compaction operations
+in order to stay within quality of service requirements. Note that many NVMe
+devices today do not yet support queue priorities, so the blobstore considers
+this feature optional.
+
+## The Basics
+
+The blobstore defines a hierarchy of three units of disk space. The smallest are
+the *logical blocks* exposed by the disk itself, which are numbered from 0 to N,
+where N is the number of blocks in the disk. A logical block is typically
+either 512B or 4KiB.
+
+The blobstore defines a *page* to be a fixed number of logical blocks defined
+at blobstore creation time. The logical blocks that compose a page are
+contiguous. Pages are also numbered from the beginning of the disk such that
+the first page worth of blocks is page 0, the second page is page 1, etc. A
+page is typically 4KiB in size, so this is either 8 or 1 logical blocks in
+practice. The device must be able to perform atomic reads and writes of at
+least the page size.
+
+The largest unit is a *cluster*, which is a fixed number of pages defined at
+blobstore creation time. The pages that compose a cluster are contiguous.
+Clusters are also numbered from the beginning of the disk, where cluster 0 is
+the first cluster worth of pages, cluster 1 is the second grouping of pages,
+etc. A cluster is typically 1MiB in size, or 256 pages.
+
+On top of these three basic units, the blobstore defines three primitives. The
+most fundamental is the blob, where a blob is an ordered list of clusters plus
+an identifier. Blobs persist across power failures and reboots. The set of all
+blobs described by shared metadata is called the blobstore. I/O operations on
+blobs are submitted through a channel. Channels are tied to threads, but
+multiple threads can simultaneously submit I/O operations to the same blob on
+their own channels.
+
+Blobs are read and written in units of pages by specifying an offset in the
+virtual blob address space. This offset is translated by first determining
+which cluster(s) are being accessed, and then translating to a set of logical
+blocks. This translation is done trivially using only basic math - there is no
+mapping data structure. Unlike read and write, blobs are resized in units of
+clusters.
+
+Blobs are described by their metadata which consists of a discontiguous set of
+pages stored in a reserved region on the disk. Each page of metadata is
+referred to as a *metadata page*. Blobs do not share metadata pages with other
+blobs, and in fact the design relies on the backing storage device supporting
+an atomic write unit greater than or equal to the page size. Most devices
+backed by NAND and next generation media support this atomic write capability,
+but often magnetic media does not.
+
+The metadata region is fixed in size and defined upon creation of the
+blobstore. The size is configurable, but by default one page is allocated for
+each cluster. For 1MiB clusters and 4KiB pages, that results in 0.4% metadata
+overhead.
+
+## Conventions
+
+Data formats on the device are specified in [Backus-Naur
+Form](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form). All data is
+stored on media in little-endian format. Unspecified data must be zeroed.
+
+## Media Format
+
+The blobstore owns the entire storage device. The device is divided into
+clusters starting from the beginning, such that cluster 0 begins at the first
+logical block.
+
+    LBA 0                                   LBA N
+    +-----------+-----------+-----+-----------+
+    | Cluster 0 | Cluster 1 | ... | Cluster N |
+    +-----------+-----------+-----+-----------+
+
+Or in formal notation:
+
+    <media-format> ::= <cluster0> <cluster>*
+
+
+Cluster 0 is special and has the following format, where page 0
+is the first page of the cluster:
+
+    +--------+-------------------+
+    | Page 0 | Page 1 ... Page N |
+    +--------+-------------------+
+    | Super  |  Metadata Region  |
+    | Block  |                   |
+    +--------+-------------------+
+
+Or formally:
+
+    <cluster0> ::= <super-block> <metadata-region>
+
+The super block is a single page located at the beginning of the partition.
+It contains basic information about the blobstore. The metadata region
+is the remainder of cluster 0 and may extend to additional clusters.
+
+    <super-block> ::= <sb-version> <sb-len> <sb-super-blob> <sb-params>
+                      <sb-metadata-start> <sb-metadata-len>
+    <sb-version> ::= u32
+    <sb-len> ::= u32 # Length of this super block, in bytes. Starts from the
+                     # beginning of this structure.
+    <sb-super-blob> ::= u64 # Special blobid set by the user that indicates where
+                            # their starting metadata resides.
+
+    <sb-md-start> ::= u64 # Metadata start location, in pages
+    <sb-md-len> ::= u64 # Metadata length, in pages
+
+The `<sb-params>` data contains parameters specified by the user when the blob
+store was initially formatted.
+
+    <sb-params> ::= <sb-page-size> <sb-cluster-size>
+    <sb-page-size> ::= u32 # page size, in bytes.
+                           # Must be a multiple of the logical block size.
+                           # The implementation today requires this to be 4KiB.
+    <sb-cluster-size> ::= u32 # Cluster size, in bytes.
+                              # Must be a multiple of the page size.
+
+Each blob is allocated a non-contiguous set of pages inside the metadata region
+for its metadata. These pages form a linked list. The first page in the list
+will be written in place on update, while all other pages will be written to
+fresh locations. This requires the backing device to support an atomic write
+size greater than or equal to the page size to guarantee that the operation is
+atomic. See the section on atomicity for details.
+
+Each page is defined as:
+
+    <metadata-page> ::= <blob-id> <blob-sequence-num> <blob-descriptor>*
+                        <blob-next> <blob-crc>
+    <blob-id> ::= u64 # The blob guid
+    <blob-sequence-num> ::= u32 # The sequence number of this page in the linked
+                                # list.
+
+    <blob-descriptor> ::= <blob-descriptor-type> <blob-descriptor-length>
+                            <blob-descriptor-data>
+    <blob-descriptor-type> ::= u8 # 0 means padding, 1 means "extent", 2 means
+                                  # xattr. The type
+                                  # describes how to interpret the descriptor data.
+    <blob-descriptor-length> ::= u32 # Length of the entire descriptor
+
+    <blob-descriptor-data-padding> ::= u8
+
+    <blob-descriptor-data-extent> ::= <extent-cluster-id> <extent-cluster-count>
+    <extent-cluster-id> ::= u32 # The cluster id where this extent starts
+    <extent-cluster-count> ::= u32 # The number of clusters in this extent
+
+    <blob-descriptor-data-xattr> ::= <xattr-name-length> <xattr-value-length>
+                                     <xattr-name> <xattr-value>
+    <xattr-name-length> ::= u16
+    <xattr-value-length> ::= u16
+    <xattr-name> ::= u8*
+    <xattr-value> ::= u8*
+
+    <blob-next> ::= u32 # The offset into the metadata region that contains the
+                        # next page of metadata. 0 means no next page.
+    <blob-crc> ::= u32 # CRC of the entire page
+
+
+Descriptors cannot span metadata pages.
+
+## Atomicity
+
+Metadata in the blobstore is cached and must be explicitly synced by the user.
+Data is not cached, however, so when a write completes the data can be
+considered durable if the metadata is synchronized. Metadata does not often
+change, and in fact only must be synchronized after these explicit operations:
+
+* resize
+* set xattr
+* remove xattr
+
+Any other operation will not dirty the metadata. Further, the metadata for each
+blob is independent of all of the others, so a synchronization operation is
+only needed on the specific blob that is dirty.
+
+The metadata consists of a linked list of pages. Updates to the metadata are
+done by first writing page 2 through N to a new location, writing page 1 in
+place to atomically update the chain, and then erasing the remainder of the old
+chain. The vast majority of the time, blobs consist of just a single metadata
+page and so this operation is very efficient. For this scheme to work the write
+to the first page must be atomic, which requires hardware support from the
+backing device. For most, if not all, NVMe SSDs, an atomic write unit of 4KiB
+can be expected. Devices specify their atomic write unit in their NVMe identify
+data - specifically in the AWUN field.
diff --git a/doc/index.md b/doc/index.md
index 76df9e6ff..26b7f8353 100644
--- a/doc/index.md
+++ b/doc/index.md
@@ -25,4 +25,5 @@ which avoids kernel context switches and eliminates interrupt handling overhead.
 - @ref ioat
 - @ref iscsi
 - @ref bdev
+- @ref blob
 - @ref blobfs