doc/blob: thin provisioning, snapshots, and clones

Describe how blobstores handle thin provisioning, snapshots, and clones. Signed-off-by: Mike Gerdts <mgerdts@nvidia.com> Change-Id: Ie6f1b69799a404a373269986fe8a89c36f381620 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/11270 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Shuhei Matsumoto <smatsumoto@nvidia.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2021-12-21 20:45:26 +00:00 · 2021-12-21 20:45:26 +00:00 · 45e0a2a370
commit 45e0a2a370
parent fa272c9bc6
1 changed files with 127 additions and 0 deletions
--- a/doc/blob.md
+++ b/doc/blob.md
@ -302,6 +302,133 @@ when creating a blob.
  Extents pointing to contiguous LBA are run-length encoded, including unallocated extents represented by 0.
  Every new cluster allocation incurs serializing whole linked list of pages for the blob.
 ### Thin Blobs, Snapshots, and Clones
 Each in-use cluster is allocated to blobstore metadata or to a particular blob. Once a cluster is
 allocated to a blob it is considered owned by that blob and that particular blob's metadata
 maintains a reference to the cluster as a record of ownership. Cluster ownership is transferred
 during snapshot operations described later in @ref blob_pg_snapshots.
 Through the use of thin provisioning, snapshots, and/or clones, a blob may be backed by clusters it
 owns, clusters owned by another blob, or by a zeroes device. The behavior of reads and writes depend
 on whether the operation targets blocks that are backed by a cluster owned by the blob or not.
 * **read from blocks on an owned cluster**: The read is serviced by reading directly from the
  appropriate cluster.
 * **read from other blocks**: The read is passed on to the blob's *back device* and the back
  device services the read. The back device may be another blob or it may be a zeroes device.
 * **write to blocks on an owned cluster**: The write is serviced by writing directly to the
  appropriate cluster.
 * **write to thin provisioned cluster**: If the back device is the zeroes device and no cluster
  is allocated to the blob the process described in @ref blob_pg_thin_provisioning is followed.
 * **write to other blocks**: A copy-on-write operation is triggered. See @ref blob_pg_copy_on_write
  for details.
 #### Thin Provisioning {#blob_pg_thin_provisioning}
 As mentioned in @ref blob_pg_cluster_layout, a blob may be thin provisioned. A thin provisioned blob
 starts out with no allocated clusters. Clusters are allocated as writes occur. A thin provisioned
 blob's back device is a *zeroes device*. A read from a zeroes device fills the read buffer with
 zeroes.
 When a thin provisioned volume writes to a block that does not have an allocated cluster, the
 following steps are performed:
 1. Allocate a cluster.
 2. Update blob metadata.
 3. Perform the write.
 #### Snapshots and Clones {#blob_pg_snapshots}
 A snapshot is a read-only blob that may have clones. A snapshot may itself be a clone of one other
 blob. While the interface gives the illusion of being able to create many snapshots of a blob, under
 the covers this results in a chain of snapshots that are clones of the previous snapshot.
 When blob1 is snapshotted, a new read-only blob is created and blob1 becomes a clone of this new
 blob. That is:
 | Step | Action                         | State                                             |
 | ---- | ------------------------------ | ------------------------------------------------- |
 | 1    | Create blob1                   | `blob1 (rw)`                                      |
 | 2    | Create snapshot blob2 of blob1 | `blob1 (rw) --> blob2 (ro)`                       |
 | 2a   | Write to blob1                 | `blob1 (rw) --> blob2 (ro)`                       |
 | 3    | Create snapshot blob3 of blob1 | `blob1 (rw) --> blob3 (ro) ---> blob2 (ro)`       |
 Supposing blob1 was not thin provisioned, step 1 would have allocated clusters needed to perform a
 full write of blob1. As blob2 is created in step 2, the ownership of all of blob1's clusters is
 transferred to blob2 and blob2 becomes blob1's back device. During step2a, the writes to blob1 cause
 one or more clusters to be allocated to blob1. When blob3 is created in step 3, the clusters
 allocated in step 2a are given to blob3, blob3's back device becomes blob2, and blob1's back device
 becomes blob3.
 It is important to understand the chain above when considering strategies to use a golden image from
 which many clones are made. The IO path is more efficient if one snapshot is cloned many times than
 it is to create a new snapshot for every clone. The following illustrates the difference.
 Using a single snapshot means the data originally referenced by the golden image is always one hop
 away.
 ```text
 create golden                           golden --> golden-snap
 snapshot golden as golden-snap                     ^ ^ ^
 clone golden-snap as clone1              clone1 ---+ | |
 clone golden-snap as clone2              clone2 -----+ |
 clone golden-snap as clone3              clone3 -------+
 ```
 Using a snapshot per clone means that the chain of back devices grows with every new snapshot and
 clone pair. Reading a block from clone3 may result in a read from clone3's back device (snap3), from
 clone2's back device (snap2), then finally clone1's back device (snap1, the current owner of the
 blocks originally allocated to golden).
 ```text
 create golden
 snapshot golden as snap1                golden --> snap3 -----> snap2 ----> snap1
 clone snap1 as clone1                   clone3----/   clone2 --/  clone1 --/
 snapshot golden as snap2
 clone snap2 as clone2
 snapshot golden as snap3
 clone snap3 as clone3
 ```
 A snapshot with no more than one clone can be deleted. When a snapshot with one clone is deleted,
 the clone becomes a regular blob. The clusters owned by the snapshot are transferred to the clone or
 freed, depending on whether the clone already owns a cluster for a particular block range.
 Removal of the last clone leaves the snapshot in place. This snapshot continues to be read-only and
 can serve as the snapshot for future clones.
 #### Inflating and Decoupling Clones
 A clone can remove its dependence on a snapshot with the following operations:
 1. Inflate the clone. Clusters backed by any snapshot or a zeroes device are copied into newly
   allocated clusters. The blob becomes a thick provisioned blob.
 2. Decouple the clone. Clusters backed by the first back device snapshot are copied into newly
   allocated clusters. If the clone's back device snapshot was itself a clone of another
   snapshot, the clone remains a clone but is now a clone of a different snapshot.
 3. Remove the snapshot. This is only possible if the snapshot has one clone. The end result is
   usually the same as decoupling but ownership of clusters is transferred from the snapshot rather
   than being copied. If the snapshot that was deleted was itself a clone of another snapshot, the
   clone remains a clone, but is now a clone of a different snapshot.
 #### Copy-on-write {#blob_pg_copy_on_write}
 A copy-on-write operation is somewhat expensive, with the cost being proportional to the cluster
 size. Typical copy-on-write involves the following steps:
 1. Allocate a cluster.
 2. Allocate a cluster-sized buffer into which data can be read.
 3. Trigger a full-cluster read from the back device into the cluster-sized buffer.
 4. Write from the cluster-sized buffer into the newly allocated cluster.
 5. Update the blob's on-disk metadata to record ownership of the newly allocated cluster. This
   involves at least one page-sized write.
 6. Write the new data to the just allocated and copied cluster.
 If the source cluster is backed by a zeroes device, steps 2 through 4 are skipped. Alternatively, if
 the blobstore resides on a device that can perform the copy on its own, steps 2 through 4 are
 offloaded to the device.
 ### Sequences and Batches
 Internally Blobstore uses the concepts of sequences and batches to submit IO to the underlying device in either