From fcb333e5b539e9dd541f1f355db01babab61a0ee Mon Sep 17 00:00:00 2001 From: Ben Walker Date: Mon, 20 Mar 2017 14:21:12 -0700 Subject: [PATCH] blob: Add a design document This outlines the basic concepts in the blobstore. Change-Id: Ib44c3bd5c04ddb0cc06eeb0e50ae6ff1bdf94dd5 Signed-off-by: Ben Walker --- doc/Doxyfile | 1 + doc/blob/index.md | 277 ++++++++++++++++++++++++++++++++++++++++++++++ doc/index.md | 1 + 3 files changed, 279 insertions(+) create mode 100644 doc/blob/index.md diff --git a/doc/Doxyfile b/doc/Doxyfile index 9ba9aa281..76f03812a 100644 --- a/doc/Doxyfile +++ b/doc/Doxyfile @@ -764,6 +764,7 @@ INPUT = ../include/spdk \ porting.md \ bdev/index.md \ bdev/getting_started.md \ + blob/index.md \ blobfs/index.md \ blobfs/getting_started.md \ event/index.md \ diff --git a/doc/blob/index.md b/doc/blob/index.md new file mode 100644 index 000000000..71f20416f --- /dev/null +++ b/doc/blob/index.md @@ -0,0 +1,277 @@ +# Blobstore {#blob} + +## Introduction + +The blobstore is a persistent, power-fail safe block allocator designed to be +used as the local storage system backing a higher level storage service, +typically in lieu of a traditional filesystem. These higher level services can +be local databases or key/value stores (MySQL, RocksDB), they can be dedicated +appliances (SAN, NAS), or distributed storage systems (ex. Ceph, Cassandra). It +is not designed to be a general purpose filesystem, however, and it is +intentionally not POSIX compliant. To avoid confusion, no reference to files or +objects will be made at all, instead using the term 'blob'. The blobstore is +designed to allow asynchronous, uncached, parallel reads and writes to groups +of blocks on a block device called 'blobs'. Blobs are typically large, +measured in at least hundreds of kilobytes, and are always a multiple of the +underlying block size. + +The blobstore is designed primarily to run on "next generation" media, which +means the device supports fast random reads _and_ writes, with no required +background garbage collection. However, in practice the design will run well on +NAND too. Absolutely no attempt will be made to make this efficient on spinning +media. + +## Design Goals + +The blobstore is intended to solve a number of problems that local databases +have when using traditional POSIX filesystems. These databases are assumed to +'own' the entire storage device, to not need to track access times, and to +require only a very simple directory hierarchy. These assumptions allow +significant design optimizations over a traditional POSIX filesystem and block +stack. + +Asynchronous I/O can be an order of magnitude or more faster than synchronous +I/O, and so solutions like +[libaio](https://git.fedorahosted.org/cgit/libaio.git/) have become popular. +However, libaio is [not actually +asynchronous](http://www.scylladb.com/2016/02/09/qualifying-filesystems/) in +all cases. The blobstore will provide truly asynchronous operations in all +cases without any hidden locks or stalls. + +With the advent of NVMe, storage devices now have a hardware interface that +allows for highly parallel I/O submission from many threads with no locks. +Unfortunately, placement of data on a device requires some central coordination +to avoid conflicts. The blobstore will separate operations that require +coordination from operations that do not, and allow users to explictly +associate I/O with channels. Operations on different channels happen in +parallel, all the way down to the hardware, with no locks or coordination. + +As media access latency improves, strategies for in-memory caching are changing +and often the kernel page cache is a bottleneck. Many databases have moved to +opening files only in O_DIRECT mode, avoiding the page cache entirely, and +writing their own caching layer. With the introduction of next generation media +and its additional expected latency reductions, this strategy will become far +more prevalent. To support this, the blobstore will perform no in-memory +caching of data at all, essentially making all blob operations conceptually +equivalent to O_DIRECT. This means the blobstore has similar restrictions to +O_DIRECT where data can only be read or written in units of pages (4KiB), +although memory alignment requirements are much less strict than O_DIRECT (the +pages can even be composed of scattered buffers). We fully expect that DRAM +caching will remain critical to performance, but leave the specifics of the +cache design to higher layers. + +Storage devices pull data from host memory using a DMA engine, and those DMA +engines operate on physical addresses and often introduce alignment +restrictions. Further, to avoid data corruption, the data must not be paged out +by the operating system while it is being transferred to disk. Traditionally, +operating systems solve this problem either by copying user data into special +kernel buffers that were allocated for this purpose and the I/O operations are +performed to/from there, or taking locks to mark all user pages as locked and +unmovable. Historically, the time to perform the copy or locking was +inconsequential relative to the I/O time at the storage device, but that is +simply no longer the case. The blobstore will instead provide zero copy, +lockless read and write access to the device. To do this, memory to be used for +blob data must be registered with the blobstore up front, preferably at +application start and out of the I/O path, so that it can be pinned, the +physical addresses can be determined, and the alignment requirements can be +verified. + +Hardware devices are necessarily limited to some maximum queue depth. For NVMe +devices that can be quite large (the spec allows up to 64k!), but is typically +much smaller (128 - 1024 per queue). Under heavy load, databases may generate +enough requests to exceed the hardware queue depth, which requires queueing in +software. For operating systems this is often done in the generic block layer +and may cause unexpected stalls or require locks. The blobstore will avoid this +by simply failing requests with an appropriate error code when the queue is +full. This allows the blobstore to easily stick to its commitment to never +block, but may require the user to provide their own queueing layer. + +The NVMe specification has added support for specifying priorities on the +hardware queues. With a traditional filesystem and storage stack, however, +there is no reasonable way to map an I/O from an arbitrary thread to a +particular hardware queue to be processed with the priority requested. The +blobstore solves this by allowing the user to create channels with priorities, +which map directly to priorities on NVMe hardware queues. The user can then +choose the priority for an I/O by sending it on the appropriate channel. This +is incredibly useful for many databases where data intake operations need to +run with a much higher priority than background scrub and compaction operations +in order to stay within quality of service requirements. Note that many NVMe +devices today do not yet support queue priorities, so the blobstore considers +this feature optional. + +## The Basics + +The blobstore defines a hierarchy of three units of disk space. The smallest are +the *logical blocks* exposed by the disk itself, which are numbered from 0 to N, +where N is the number of blocks in the disk. A logical block is typically +either 512B or 4KiB. + +The blobstore defines a *page* to be a fixed number of logical blocks defined +at blobstore creation time. The logical blocks that compose a page are +contiguous. Pages are also numbered from the beginning of the disk such that +the first page worth of blocks is page 0, the second page is page 1, etc. A +page is typically 4KiB in size, so this is either 8 or 1 logical blocks in +practice. The device must be able to perform atomic reads and writes of at +least the page size. + +The largest unit is a *cluster*, which is a fixed number of pages defined at +blobstore creation time. The pages that compose a cluster are contiguous. +Clusters are also numbered from the beginning of the disk, where cluster 0 is +the first cluster worth of pages, cluster 1 is the second grouping of pages, +etc. A cluster is typically 1MiB in size, or 256 pages. + +On top of these three basic units, the blobstore defines three primitives. The +most fundamental is the blob, where a blob is an ordered list of clusters plus +an identifier. Blobs persist across power failures and reboots. The set of all +blobs described by shared metadata is called the blobstore. I/O operations on +blobs are submitted through a channel. Channels are tied to threads, but +multiple threads can simultaneously submit I/O operations to the same blob on +their own channels. + +Blobs are read and written in units of pages by specifying an offset in the +virtual blob address space. This offset is translated by first determining +which cluster(s) are being accessed, and then translating to a set of logical +blocks. This translation is done trivially using only basic math - there is no +mapping data structure. Unlike read and write, blobs are resized in units of +clusters. + +Blobs are described by their metadata which consists of a discontiguous set of +pages stored in a reserved region on the disk. Each page of metadata is +referred to as a *metadata page*. Blobs do not share metadata pages with other +blobs, and in fact the design relies on the backing storage device supporting +an atomic write unit greater than or equal to the page size. Most devices +backed by NAND and next generation media support this atomic write capability, +but often magnetic media does not. + +The metadata region is fixed in size and defined upon creation of the +blobstore. The size is configurable, but by default one page is allocated for +each cluster. For 1MiB clusters and 4KiB pages, that results in 0.4% metadata +overhead. + +## Conventions + +Data formats on the device are specified in [Backus-Naur +Form](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form). All data is +stored on media in little-endian format. Unspecified data must be zeroed. + +## Media Format + +The blobstore owns the entire storage device. The device is divided into +clusters starting from the beginning, such that cluster 0 begins at the first +logical block. + + LBA 0 LBA N + +-----------+-----------+-----+-----------+ + | Cluster 0 | Cluster 1 | ... | Cluster N | + +-----------+-----------+-----+-----------+ + +Or in formal notation: + + ::= * + + +Cluster 0 is special and has the following format, where page 0 +is the first page of the cluster: + + +--------+-------------------+ + | Page 0 | Page 1 ... Page N | + +--------+-------------------+ + | Super | Metadata Region | + | Block | | + +--------+-------------------+ + +Or formally: + + ::= + +The super block is a single page located at the beginning of the partition. +It contains basic information about the blobstore. The metadata region +is the remainder of cluster 0 and may extend to additional clusters. + + ::= + + ::= u32 + ::= u32 # Length of this super block, in bytes. Starts from the + # beginning of this structure. + ::= u64 # Special blobid set by the user that indicates where + # their starting metadata resides. + + ::= u64 # Metadata start location, in pages + ::= u64 # Metadata length, in pages + +The `` data contains parameters specified by the user when the blob +store was initially formatted. + + ::= + ::= u32 # page size, in bytes. + # Must be a multiple of the logical block size. + # The implementation today requires this to be 4KiB. + ::= u32 # Cluster size, in bytes. + # Must be a multiple of the page size. + +Each blob is allocated a non-contiguous set of pages inside the metadata region +for its metadata. These pages form a linked list. The first page in the list +will be written in place on update, while all other pages will be written to +fresh locations. This requires the backing device to support an atomic write +size greater than or equal to the page size to guarantee that the operation is +atomic. See the section on atomicity for details. + +Each page is defined as: + + ::= * + + ::= u64 # The blob guid + ::= u32 # The sequence number of this page in the linked + # list. + + ::= + + ::= u8 # 0 means padding, 1 means "extent", 2 means + # xattr. The type + # describes how to interpret the descriptor data. + ::= u32 # Length of the entire descriptor + + ::= u8 + + ::= + ::= u32 # The cluster id where this extent starts + ::= u32 # The number of clusters in this extent + + ::= + + ::= u16 + ::= u16 + ::= u8* + ::= u8* + + ::= u32 # The offset into the metadata region that contains the + # next page of metadata. 0 means no next page. + ::= u32 # CRC of the entire page + + +Descriptors cannot span metadata pages. + +## Atomicity + +Metadata in the blobstore is cached and must be explicitly synced by the user. +Data is not cached, however, so when a write completes the data can be +considered durable if the metadata is synchronized. Metadata does not often +change, and in fact only must be synchronized after these explicit operations: + +* resize +* set xattr +* remove xattr + +Any other operation will not dirty the metadata. Further, the metadata for each +blob is independent of all of the others, so a synchronization operation is +only needed on the specific blob that is dirty. + +The metadata consists of a linked list of pages. Updates to the metadata are +done by first writing page 2 through N to a new location, writing page 1 in +place to atomically update the chain, and then erasing the remainder of the old +chain. The vast majority of the time, blobs consist of just a single metadata +page and so this operation is very efficient. For this scheme to work the write +to the first page must be atomic, which requires hardware support from the +backing device. For most, if not all, NVMe SSDs, an atomic write unit of 4KiB +can be expected. Devices specify their atomic write unit in their NVMe identify +data - specifically in the AWUN field. diff --git a/doc/index.md b/doc/index.md index 76df9e6ff..26b7f8353 100644 --- a/doc/index.md +++ b/doc/index.md @@ -25,4 +25,5 @@ which avoids kernel context switches and eliminates interrupt handling overhead. - @ref ioat - @ref iscsi - @ref bdev +- @ref blob - @ref blobfs