doc/blob: thin provisioning, snapshots, and clones

Describe how blobstores handle thin provisioning, snapshots, and clones.

Signed-off-by: Mike Gerdts <mgerdts@nvidia.com>
Change-Id: Ie6f1b69799a404a373269986fe8a89c36f381620
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/11270
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Shuhei Matsumoto <smatsumoto@nvidia.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
This commit is contained in:
Mike Gerdts 2021-12-21 20:45:26 +00:00 committed by Tomasz Zawadzki
parent fa272c9bc6
commit 45e0a2a370

View File

@ -302,6 +302,133 @@ when creating a blob.
Extents pointing to contiguous LBA are run-length encoded, including unallocated extents represented by 0.
Every new cluster allocation incurs serializing whole linked list of pages for the blob.
### Thin Blobs, Snapshots, and Clones
Each in-use cluster is allocated to blobstore metadata or to a particular blob. Once a cluster is
allocated to a blob it is considered owned by that blob and that particular blob's metadata
maintains a reference to the cluster as a record of ownership. Cluster ownership is transferred
during snapshot operations described later in @ref blob_pg_snapshots.
Through the use of thin provisioning, snapshots, and/or clones, a blob may be backed by clusters it
owns, clusters owned by another blob, or by a zeroes device. The behavior of reads and writes depend
on whether the operation targets blocks that are backed by a cluster owned by the blob or not.
* **read from blocks on an owned cluster**: The read is serviced by reading directly from the
appropriate cluster.
* **read from other blocks**: The read is passed on to the blob's *back device* and the back
device services the read. The back device may be another blob or it may be a zeroes device.
* **write to blocks on an owned cluster**: The write is serviced by writing directly to the
appropriate cluster.
* **write to thin provisioned cluster**: If the back device is the zeroes device and no cluster
is allocated to the blob the process described in @ref blob_pg_thin_provisioning is followed.
* **write to other blocks**: A copy-on-write operation is triggered. See @ref blob_pg_copy_on_write
for details.
#### Thin Provisioning {#blob_pg_thin_provisioning}
As mentioned in @ref blob_pg_cluster_layout, a blob may be thin provisioned. A thin provisioned blob
starts out with no allocated clusters. Clusters are allocated as writes occur. A thin provisioned
blob's back device is a *zeroes device*. A read from a zeroes device fills the read buffer with
zeroes.
When a thin provisioned volume writes to a block that does not have an allocated cluster, the
following steps are performed:
1. Allocate a cluster.
2. Update blob metadata.
3. Perform the write.
#### Snapshots and Clones {#blob_pg_snapshots}
A snapshot is a read-only blob that may have clones. A snapshot may itself be a clone of one other
blob. While the interface gives the illusion of being able to create many snapshots of a blob, under
the covers this results in a chain of snapshots that are clones of the previous snapshot.
When blob1 is snapshotted, a new read-only blob is created and blob1 becomes a clone of this new
blob. That is:
| Step | Action | State |
| ---- | ------------------------------ | ------------------------------------------------- |
| 1 | Create blob1 | `blob1 (rw)` |
| 2 | Create snapshot blob2 of blob1 | `blob1 (rw) --> blob2 (ro)` |
| 2a | Write to blob1 | `blob1 (rw) --> blob2 (ro)` |
| 3 | Create snapshot blob3 of blob1 | `blob1 (rw) --> blob3 (ro) ---> blob2 (ro)` |
Supposing blob1 was not thin provisioned, step 1 would have allocated clusters needed to perform a
full write of blob1. As blob2 is created in step 2, the ownership of all of blob1's clusters is
transferred to blob2 and blob2 becomes blob1's back device. During step2a, the writes to blob1 cause
one or more clusters to be allocated to blob1. When blob3 is created in step 3, the clusters
allocated in step 2a are given to blob3, blob3's back device becomes blob2, and blob1's back device
becomes blob3.
It is important to understand the chain above when considering strategies to use a golden image from
which many clones are made. The IO path is more efficient if one snapshot is cloned many times than
it is to create a new snapshot for every clone. The following illustrates the difference.
Using a single snapshot means the data originally referenced by the golden image is always one hop
away.
```text
create golden golden --> golden-snap
snapshot golden as golden-snap ^ ^ ^
clone golden-snap as clone1 clone1 ---+ | |
clone golden-snap as clone2 clone2 -----+ |
clone golden-snap as clone3 clone3 -------+
```
Using a snapshot per clone means that the chain of back devices grows with every new snapshot and
clone pair. Reading a block from clone3 may result in a read from clone3's back device (snap3), from
clone2's back device (snap2), then finally clone1's back device (snap1, the current owner of the
blocks originally allocated to golden).
```text
create golden
snapshot golden as snap1 golden --> snap3 -----> snap2 ----> snap1
clone snap1 as clone1 clone3----/ clone2 --/ clone1 --/
snapshot golden as snap2
clone snap2 as clone2
snapshot golden as snap3
clone snap3 as clone3
```
A snapshot with no more than one clone can be deleted. When a snapshot with one clone is deleted,
the clone becomes a regular blob. The clusters owned by the snapshot are transferred to the clone or
freed, depending on whether the clone already owns a cluster for a particular block range.
Removal of the last clone leaves the snapshot in place. This snapshot continues to be read-only and
can serve as the snapshot for future clones.
#### Inflating and Decoupling Clones
A clone can remove its dependence on a snapshot with the following operations:
1. Inflate the clone. Clusters backed by any snapshot or a zeroes device are copied into newly
allocated clusters. The blob becomes a thick provisioned blob.
2. Decouple the clone. Clusters backed by the first back device snapshot are copied into newly
allocated clusters. If the clone's back device snapshot was itself a clone of another
snapshot, the clone remains a clone but is now a clone of a different snapshot.
3. Remove the snapshot. This is only possible if the snapshot has one clone. The end result is
usually the same as decoupling but ownership of clusters is transferred from the snapshot rather
than being copied. If the snapshot that was deleted was itself a clone of another snapshot, the
clone remains a clone, but is now a clone of a different snapshot.
#### Copy-on-write {#blob_pg_copy_on_write}
A copy-on-write operation is somewhat expensive, with the cost being proportional to the cluster
size. Typical copy-on-write involves the following steps:
1. Allocate a cluster.
2. Allocate a cluster-sized buffer into which data can be read.
3. Trigger a full-cluster read from the back device into the cluster-sized buffer.
4. Write from the cluster-sized buffer into the newly allocated cluster.
5. Update the blob's on-disk metadata to record ownership of the newly allocated cluster. This
involves at least one page-sized write.
6. Write the new data to the just allocated and copied cluster.
If the source cluster is backed by a zeroes device, steps 2 through 4 are skipped. Alternatively, if
the blobstore resides on a device that can perform the copy on its own, steps 2 through 4 are
offloaded to the device.
### Sequences and Batches
Internally Blobstore uses the concepts of sequences and batches to submit IO to the underlying device in either