From 45e0a2a370bd7c7e7894a09d9092fb8969b67991 Mon Sep 17 00:00:00 2001
From: Mike Gerdts <mgerdts@nvidia.com>
Date: Tue, 21 Dec 2021 20:45:26 +0000
Subject: [PATCH] doc/blob: thin provisioning, snapshots, and clones

Describe how blobstores handle thin provisioning, snapshots, and clones.

Signed-off-by: Mike Gerdts <mgerdts@nvidia.com>
Change-Id: Ie6f1b69799a404a373269986fe8a89c36f381620
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/11270
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Shuhei Matsumoto <smatsumoto@nvidia.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
---
 doc/blob.md | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 127 insertions(+)

diff --git a/doc/blob.md b/doc/blob.md
index dc24ecd40..542b294b7 100644
--- a/doc/blob.md
+++ b/doc/blob.md
@@ -302,6 +302,133 @@ when creating a blob.
   Extents pointing to contiguous LBA are run-length encoded, including unallocated extents represented by 0.
   Every new cluster allocation incurs serializing whole linked list of pages for the blob.
 
+### Thin Blobs, Snapshots, and Clones
+
+Each in-use cluster is allocated to blobstore metadata or to a particular blob. Once a cluster is
+allocated to a blob it is considered owned by that blob and that particular blob's metadata
+maintains a reference to the cluster as a record of ownership. Cluster ownership is transferred
+during snapshot operations described later in @ref blob_pg_snapshots.
+
+Through the use of thin provisioning, snapshots, and/or clones, a blob may be backed by clusters it
+owns, clusters owned by another blob, or by a zeroes device. The behavior of reads and writes depend
+on whether the operation targets blocks that are backed by a cluster owned by the blob or not.
+
+* **read from blocks on an owned cluster**: The read is serviced by reading directly from the
+  appropriate cluster.
+* **read from other blocks**: The read is passed on to the blob's *back device* and the back
+  device services the read. The back device may be another blob or it may be a zeroes device.
+* **write to blocks on an owned cluster**: The write is serviced by writing directly to the
+  appropriate cluster.
+* **write to thin provisioned cluster**: If the back device is the zeroes device and no cluster
+  is allocated to the blob the process described in @ref blob_pg_thin_provisioning is followed.
+* **write to other blocks**: A copy-on-write operation is triggered. See @ref blob_pg_copy_on_write
+  for details.
+
+#### Thin Provisioning {#blob_pg_thin_provisioning}
+
+As mentioned in @ref blob_pg_cluster_layout, a blob may be thin provisioned. A thin provisioned blob
+starts out with no allocated clusters. Clusters are allocated as writes occur. A thin provisioned
+blob's back device is a *zeroes device*. A read from a zeroes device fills the read buffer with
+zeroes.
+
+When a thin provisioned volume writes to a block that does not have an allocated cluster, the
+following steps are performed:
+
+1. Allocate a cluster.
+2. Update blob metadata.
+3. Perform the write.
+
+#### Snapshots and Clones {#blob_pg_snapshots}
+
+A snapshot is a read-only blob that may have clones. A snapshot may itself be a clone of one other
+blob. While the interface gives the illusion of being able to create many snapshots of a blob, under
+the covers this results in a chain of snapshots that are clones of the previous snapshot.
+
+When blob1 is snapshotted, a new read-only blob is created and blob1 becomes a clone of this new
+blob. That is:
+
+| Step | Action                         | State                                             |
+| ---- | ------------------------------ | ------------------------------------------------- |
+| 1    | Create blob1                   | `blob1 (rw)`                                      |
+| 2    | Create snapshot blob2 of blob1 | `blob1 (rw) --> blob2 (ro)`                       |
+| 2a   | Write to blob1                 | `blob1 (rw) --> blob2 (ro)`                       |
+| 3    | Create snapshot blob3 of blob1 | `blob1 (rw) --> blob3 (ro) ---> blob2 (ro)`       |
+
+Supposing blob1 was not thin provisioned, step 1 would have allocated clusters needed to perform a
+full write of blob1. As blob2 is created in step 2, the ownership of all of blob1's clusters is
+transferred to blob2 and blob2 becomes blob1's back device. During step2a, the writes to blob1 cause
+one or more clusters to be allocated to blob1. When blob3 is created in step 3, the clusters
+allocated in step 2a are given to blob3, blob3's back device becomes blob2, and blob1's back device
+becomes blob3.
+
+It is important to understand the chain above when considering strategies to use a golden image from
+which many clones are made. The IO path is more efficient if one snapshot is cloned many times than
+it is to create a new snapshot for every clone. The following illustrates the difference.
+
+Using a single snapshot means the data originally referenced by the golden image is always one hop
+away.
+
+```text
+create golden                           golden --> golden-snap
+snapshot golden as golden-snap                     ^ ^ ^
+clone golden-snap as clone1              clone1 ---+ | |
+clone golden-snap as clone2              clone2 -----+ |
+clone golden-snap as clone3              clone3 -------+
+```
+
+Using a snapshot per clone means that the chain of back devices grows with every new snapshot and
+clone pair. Reading a block from clone3 may result in a read from clone3's back device (snap3), from
+clone2's back device (snap2), then finally clone1's back device (snap1, the current owner of the
+blocks originally allocated to golden).
+
+```text
+create golden
+snapshot golden as snap1                golden --> snap3 -----> snap2 ----> snap1
+clone snap1 as clone1                   clone3----/   clone2 --/  clone1 --/
+snapshot golden as snap2
+clone snap2 as clone2
+snapshot golden as snap3
+clone snap3 as clone3
+```
+
+A snapshot with no more than one clone can be deleted. When a snapshot with one clone is deleted,
+the clone becomes a regular blob. The clusters owned by the snapshot are transferred to the clone or
+freed, depending on whether the clone already owns a cluster for a particular block range.
+
+Removal of the last clone leaves the snapshot in place. This snapshot continues to be read-only and
+can serve as the snapshot for future clones.
+
+#### Inflating and Decoupling Clones
+
+A clone can remove its dependence on a snapshot with the following operations:
+
+1. Inflate the clone. Clusters backed by any snapshot or a zeroes device are copied into newly
+   allocated clusters. The blob becomes a thick provisioned blob.
+2. Decouple the clone. Clusters backed by the first back device snapshot are copied into newly
+   allocated clusters. If the clone's back device snapshot was itself a clone of another
+   snapshot, the clone remains a clone but is now a clone of a different snapshot.
+3. Remove the snapshot. This is only possible if the snapshot has one clone. The end result is
+   usually the same as decoupling but ownership of clusters is transferred from the snapshot rather
+   than being copied. If the snapshot that was deleted was itself a clone of another snapshot, the
+   clone remains a clone, but is now a clone of a different snapshot.
+
+#### Copy-on-write {#blob_pg_copy_on_write}
+
+A copy-on-write operation is somewhat expensive, with the cost being proportional to the cluster
+size. Typical copy-on-write involves the following steps:
+
+1. Allocate a cluster.
+2. Allocate a cluster-sized buffer into which data can be read.
+3. Trigger a full-cluster read from the back device into the cluster-sized buffer.
+4. Write from the cluster-sized buffer into the newly allocated cluster.
+5. Update the blob's on-disk metadata to record ownership of the newly allocated cluster. This
+   involves at least one page-sized write.
+6. Write the new data to the just allocated and copied cluster.
+
+If the source cluster is backed by a zeroes device, steps 2 through 4 are skipped. Alternatively, if
+the blobstore resides on a device that can perform the copy on its own, steps 2 through 4 are
+offloaded to the device.
+
 ### Sequences and Batches
 
 Internally Blobstore uses the concepts of sequences and batches to submit IO to the underlying device in either