doc/blob: thin provisioning, snapshots, and clones
Describe how blobstores handle thin provisioning, snapshots, and clones. Signed-off-by: Mike Gerdts <mgerdts@nvidia.com> Change-Id: Ie6f1b69799a404a373269986fe8a89c36f381620 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/11270 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Shuhei Matsumoto <smatsumoto@nvidia.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
This commit is contained in:
parent
fa272c9bc6
commit
45e0a2a370
127
doc/blob.md
127
doc/blob.md
@ -302,6 +302,133 @@ when creating a blob.
|
||||
Extents pointing to contiguous LBA are run-length encoded, including unallocated extents represented by 0.
|
||||
Every new cluster allocation incurs serializing whole linked list of pages for the blob.
|
||||
|
||||
### Thin Blobs, Snapshots, and Clones
|
||||
|
||||
Each in-use cluster is allocated to blobstore metadata or to a particular blob. Once a cluster is
|
||||
allocated to a blob it is considered owned by that blob and that particular blob's metadata
|
||||
maintains a reference to the cluster as a record of ownership. Cluster ownership is transferred
|
||||
during snapshot operations described later in @ref blob_pg_snapshots.
|
||||
|
||||
Through the use of thin provisioning, snapshots, and/or clones, a blob may be backed by clusters it
|
||||
owns, clusters owned by another blob, or by a zeroes device. The behavior of reads and writes depend
|
||||
on whether the operation targets blocks that are backed by a cluster owned by the blob or not.
|
||||
|
||||
* **read from blocks on an owned cluster**: The read is serviced by reading directly from the
|
||||
appropriate cluster.
|
||||
* **read from other blocks**: The read is passed on to the blob's *back device* and the back
|
||||
device services the read. The back device may be another blob or it may be a zeroes device.
|
||||
* **write to blocks on an owned cluster**: The write is serviced by writing directly to the
|
||||
appropriate cluster.
|
||||
* **write to thin provisioned cluster**: If the back device is the zeroes device and no cluster
|
||||
is allocated to the blob the process described in @ref blob_pg_thin_provisioning is followed.
|
||||
* **write to other blocks**: A copy-on-write operation is triggered. See @ref blob_pg_copy_on_write
|
||||
for details.
|
||||
|
||||
#### Thin Provisioning {#blob_pg_thin_provisioning}
|
||||
|
||||
As mentioned in @ref blob_pg_cluster_layout, a blob may be thin provisioned. A thin provisioned blob
|
||||
starts out with no allocated clusters. Clusters are allocated as writes occur. A thin provisioned
|
||||
blob's back device is a *zeroes device*. A read from a zeroes device fills the read buffer with
|
||||
zeroes.
|
||||
|
||||
When a thin provisioned volume writes to a block that does not have an allocated cluster, the
|
||||
following steps are performed:
|
||||
|
||||
1. Allocate a cluster.
|
||||
2. Update blob metadata.
|
||||
3. Perform the write.
|
||||
|
||||
#### Snapshots and Clones {#blob_pg_snapshots}
|
||||
|
||||
A snapshot is a read-only blob that may have clones. A snapshot may itself be a clone of one other
|
||||
blob. While the interface gives the illusion of being able to create many snapshots of a blob, under
|
||||
the covers this results in a chain of snapshots that are clones of the previous snapshot.
|
||||
|
||||
When blob1 is snapshotted, a new read-only blob is created and blob1 becomes a clone of this new
|
||||
blob. That is:
|
||||
|
||||
| Step | Action | State |
|
||||
| ---- | ------------------------------ | ------------------------------------------------- |
|
||||
| 1 | Create blob1 | `blob1 (rw)` |
|
||||
| 2 | Create snapshot blob2 of blob1 | `blob1 (rw) --> blob2 (ro)` |
|
||||
| 2a | Write to blob1 | `blob1 (rw) --> blob2 (ro)` |
|
||||
| 3 | Create snapshot blob3 of blob1 | `blob1 (rw) --> blob3 (ro) ---> blob2 (ro)` |
|
||||
|
||||
Supposing blob1 was not thin provisioned, step 1 would have allocated clusters needed to perform a
|
||||
full write of blob1. As blob2 is created in step 2, the ownership of all of blob1's clusters is
|
||||
transferred to blob2 and blob2 becomes blob1's back device. During step2a, the writes to blob1 cause
|
||||
one or more clusters to be allocated to blob1. When blob3 is created in step 3, the clusters
|
||||
allocated in step 2a are given to blob3, blob3's back device becomes blob2, and blob1's back device
|
||||
becomes blob3.
|
||||
|
||||
It is important to understand the chain above when considering strategies to use a golden image from
|
||||
which many clones are made. The IO path is more efficient if one snapshot is cloned many times than
|
||||
it is to create a new snapshot for every clone. The following illustrates the difference.
|
||||
|
||||
Using a single snapshot means the data originally referenced by the golden image is always one hop
|
||||
away.
|
||||
|
||||
```text
|
||||
create golden golden --> golden-snap
|
||||
snapshot golden as golden-snap ^ ^ ^
|
||||
clone golden-snap as clone1 clone1 ---+ | |
|
||||
clone golden-snap as clone2 clone2 -----+ |
|
||||
clone golden-snap as clone3 clone3 -------+
|
||||
```
|
||||
|
||||
Using a snapshot per clone means that the chain of back devices grows with every new snapshot and
|
||||
clone pair. Reading a block from clone3 may result in a read from clone3's back device (snap3), from
|
||||
clone2's back device (snap2), then finally clone1's back device (snap1, the current owner of the
|
||||
blocks originally allocated to golden).
|
||||
|
||||
```text
|
||||
create golden
|
||||
snapshot golden as snap1 golden --> snap3 -----> snap2 ----> snap1
|
||||
clone snap1 as clone1 clone3----/ clone2 --/ clone1 --/
|
||||
snapshot golden as snap2
|
||||
clone snap2 as clone2
|
||||
snapshot golden as snap3
|
||||
clone snap3 as clone3
|
||||
```
|
||||
|
||||
A snapshot with no more than one clone can be deleted. When a snapshot with one clone is deleted,
|
||||
the clone becomes a regular blob. The clusters owned by the snapshot are transferred to the clone or
|
||||
freed, depending on whether the clone already owns a cluster for a particular block range.
|
||||
|
||||
Removal of the last clone leaves the snapshot in place. This snapshot continues to be read-only and
|
||||
can serve as the snapshot for future clones.
|
||||
|
||||
#### Inflating and Decoupling Clones
|
||||
|
||||
A clone can remove its dependence on a snapshot with the following operations:
|
||||
|
||||
1. Inflate the clone. Clusters backed by any snapshot or a zeroes device are copied into newly
|
||||
allocated clusters. The blob becomes a thick provisioned blob.
|
||||
2. Decouple the clone. Clusters backed by the first back device snapshot are copied into newly
|
||||
allocated clusters. If the clone's back device snapshot was itself a clone of another
|
||||
snapshot, the clone remains a clone but is now a clone of a different snapshot.
|
||||
3. Remove the snapshot. This is only possible if the snapshot has one clone. The end result is
|
||||
usually the same as decoupling but ownership of clusters is transferred from the snapshot rather
|
||||
than being copied. If the snapshot that was deleted was itself a clone of another snapshot, the
|
||||
clone remains a clone, but is now a clone of a different snapshot.
|
||||
|
||||
#### Copy-on-write {#blob_pg_copy_on_write}
|
||||
|
||||
A copy-on-write operation is somewhat expensive, with the cost being proportional to the cluster
|
||||
size. Typical copy-on-write involves the following steps:
|
||||
|
||||
1. Allocate a cluster.
|
||||
2. Allocate a cluster-sized buffer into which data can be read.
|
||||
3. Trigger a full-cluster read from the back device into the cluster-sized buffer.
|
||||
4. Write from the cluster-sized buffer into the newly allocated cluster.
|
||||
5. Update the blob's on-disk metadata to record ownership of the newly allocated cluster. This
|
||||
involves at least one page-sized write.
|
||||
6. Write the new data to the just allocated and copied cluster.
|
||||
|
||||
If the source cluster is backed by a zeroes device, steps 2 through 4 are skipped. Alternatively, if
|
||||
the blobstore resides on a device that can perform the copy on its own, steps 2 through 4 are
|
||||
offloaded to the device.
|
||||
|
||||
### Sequences and Batches
|
||||
|
||||
Internally Blobstore uses the concepts of sequences and batches to submit IO to the underlying device in either
|
||||
|
Loading…
Reference in New Issue
Block a user