doc/blob: thin provisioning, snapshots, and clones
Describe how blobstores handle thin provisioning, snapshots, and clones. Signed-off-by: Mike Gerdts <mgerdts@nvidia.com> Change-Id: Ie6f1b69799a404a373269986fe8a89c36f381620 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/11270 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Shuhei Matsumoto <smatsumoto@nvidia.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
This commit is contained in:
parent
fa272c9bc6
commit
45e0a2a370
127
doc/blob.md
127
doc/blob.md
@ -302,6 +302,133 @@ when creating a blob.
|
|||||||
Extents pointing to contiguous LBA are run-length encoded, including unallocated extents represented by 0.
|
Extents pointing to contiguous LBA are run-length encoded, including unallocated extents represented by 0.
|
||||||
Every new cluster allocation incurs serializing whole linked list of pages for the blob.
|
Every new cluster allocation incurs serializing whole linked list of pages for the blob.
|
||||||
|
|
||||||
|
### Thin Blobs, Snapshots, and Clones
|
||||||
|
|
||||||
|
Each in-use cluster is allocated to blobstore metadata or to a particular blob. Once a cluster is
|
||||||
|
allocated to a blob it is considered owned by that blob and that particular blob's metadata
|
||||||
|
maintains a reference to the cluster as a record of ownership. Cluster ownership is transferred
|
||||||
|
during snapshot operations described later in @ref blob_pg_snapshots.
|
||||||
|
|
||||||
|
Through the use of thin provisioning, snapshots, and/or clones, a blob may be backed by clusters it
|
||||||
|
owns, clusters owned by another blob, or by a zeroes device. The behavior of reads and writes depend
|
||||||
|
on whether the operation targets blocks that are backed by a cluster owned by the blob or not.
|
||||||
|
|
||||||
|
* **read from blocks on an owned cluster**: The read is serviced by reading directly from the
|
||||||
|
appropriate cluster.
|
||||||
|
* **read from other blocks**: The read is passed on to the blob's *back device* and the back
|
||||||
|
device services the read. The back device may be another blob or it may be a zeroes device.
|
||||||
|
* **write to blocks on an owned cluster**: The write is serviced by writing directly to the
|
||||||
|
appropriate cluster.
|
||||||
|
* **write to thin provisioned cluster**: If the back device is the zeroes device and no cluster
|
||||||
|
is allocated to the blob the process described in @ref blob_pg_thin_provisioning is followed.
|
||||||
|
* **write to other blocks**: A copy-on-write operation is triggered. See @ref blob_pg_copy_on_write
|
||||||
|
for details.
|
||||||
|
|
||||||
|
#### Thin Provisioning {#blob_pg_thin_provisioning}
|
||||||
|
|
||||||
|
As mentioned in @ref blob_pg_cluster_layout, a blob may be thin provisioned. A thin provisioned blob
|
||||||
|
starts out with no allocated clusters. Clusters are allocated as writes occur. A thin provisioned
|
||||||
|
blob's back device is a *zeroes device*. A read from a zeroes device fills the read buffer with
|
||||||
|
zeroes.
|
||||||
|
|
||||||
|
When a thin provisioned volume writes to a block that does not have an allocated cluster, the
|
||||||
|
following steps are performed:
|
||||||
|
|
||||||
|
1. Allocate a cluster.
|
||||||
|
2. Update blob metadata.
|
||||||
|
3. Perform the write.
|
||||||
|
|
||||||
|
#### Snapshots and Clones {#blob_pg_snapshots}
|
||||||
|
|
||||||
|
A snapshot is a read-only blob that may have clones. A snapshot may itself be a clone of one other
|
||||||
|
blob. While the interface gives the illusion of being able to create many snapshots of a blob, under
|
||||||
|
the covers this results in a chain of snapshots that are clones of the previous snapshot.
|
||||||
|
|
||||||
|
When blob1 is snapshotted, a new read-only blob is created and blob1 becomes a clone of this new
|
||||||
|
blob. That is:
|
||||||
|
|
||||||
|
| Step | Action | State |
|
||||||
|
| ---- | ------------------------------ | ------------------------------------------------- |
|
||||||
|
| 1 | Create blob1 | `blob1 (rw)` |
|
||||||
|
| 2 | Create snapshot blob2 of blob1 | `blob1 (rw) --> blob2 (ro)` |
|
||||||
|
| 2a | Write to blob1 | `blob1 (rw) --> blob2 (ro)` |
|
||||||
|
| 3 | Create snapshot blob3 of blob1 | `blob1 (rw) --> blob3 (ro) ---> blob2 (ro)` |
|
||||||
|
|
||||||
|
Supposing blob1 was not thin provisioned, step 1 would have allocated clusters needed to perform a
|
||||||
|
full write of blob1. As blob2 is created in step 2, the ownership of all of blob1's clusters is
|
||||||
|
transferred to blob2 and blob2 becomes blob1's back device. During step2a, the writes to blob1 cause
|
||||||
|
one or more clusters to be allocated to blob1. When blob3 is created in step 3, the clusters
|
||||||
|
allocated in step 2a are given to blob3, blob3's back device becomes blob2, and blob1's back device
|
||||||
|
becomes blob3.
|
||||||
|
|
||||||
|
It is important to understand the chain above when considering strategies to use a golden image from
|
||||||
|
which many clones are made. The IO path is more efficient if one snapshot is cloned many times than
|
||||||
|
it is to create a new snapshot for every clone. The following illustrates the difference.
|
||||||
|
|
||||||
|
Using a single snapshot means the data originally referenced by the golden image is always one hop
|
||||||
|
away.
|
||||||
|
|
||||||
|
```text
|
||||||
|
create golden golden --> golden-snap
|
||||||
|
snapshot golden as golden-snap ^ ^ ^
|
||||||
|
clone golden-snap as clone1 clone1 ---+ | |
|
||||||
|
clone golden-snap as clone2 clone2 -----+ |
|
||||||
|
clone golden-snap as clone3 clone3 -------+
|
||||||
|
```
|
||||||
|
|
||||||
|
Using a snapshot per clone means that the chain of back devices grows with every new snapshot and
|
||||||
|
clone pair. Reading a block from clone3 may result in a read from clone3's back device (snap3), from
|
||||||
|
clone2's back device (snap2), then finally clone1's back device (snap1, the current owner of the
|
||||||
|
blocks originally allocated to golden).
|
||||||
|
|
||||||
|
```text
|
||||||
|
create golden
|
||||||
|
snapshot golden as snap1 golden --> snap3 -----> snap2 ----> snap1
|
||||||
|
clone snap1 as clone1 clone3----/ clone2 --/ clone1 --/
|
||||||
|
snapshot golden as snap2
|
||||||
|
clone snap2 as clone2
|
||||||
|
snapshot golden as snap3
|
||||||
|
clone snap3 as clone3
|
||||||
|
```
|
||||||
|
|
||||||
|
A snapshot with no more than one clone can be deleted. When a snapshot with one clone is deleted,
|
||||||
|
the clone becomes a regular blob. The clusters owned by the snapshot are transferred to the clone or
|
||||||
|
freed, depending on whether the clone already owns a cluster for a particular block range.
|
||||||
|
|
||||||
|
Removal of the last clone leaves the snapshot in place. This snapshot continues to be read-only and
|
||||||
|
can serve as the snapshot for future clones.
|
||||||
|
|
||||||
|
#### Inflating and Decoupling Clones
|
||||||
|
|
||||||
|
A clone can remove its dependence on a snapshot with the following operations:
|
||||||
|
|
||||||
|
1. Inflate the clone. Clusters backed by any snapshot or a zeroes device are copied into newly
|
||||||
|
allocated clusters. The blob becomes a thick provisioned blob.
|
||||||
|
2. Decouple the clone. Clusters backed by the first back device snapshot are copied into newly
|
||||||
|
allocated clusters. If the clone's back device snapshot was itself a clone of another
|
||||||
|
snapshot, the clone remains a clone but is now a clone of a different snapshot.
|
||||||
|
3. Remove the snapshot. This is only possible if the snapshot has one clone. The end result is
|
||||||
|
usually the same as decoupling but ownership of clusters is transferred from the snapshot rather
|
||||||
|
than being copied. If the snapshot that was deleted was itself a clone of another snapshot, the
|
||||||
|
clone remains a clone, but is now a clone of a different snapshot.
|
||||||
|
|
||||||
|
#### Copy-on-write {#blob_pg_copy_on_write}
|
||||||
|
|
||||||
|
A copy-on-write operation is somewhat expensive, with the cost being proportional to the cluster
|
||||||
|
size. Typical copy-on-write involves the following steps:
|
||||||
|
|
||||||
|
1. Allocate a cluster.
|
||||||
|
2. Allocate a cluster-sized buffer into which data can be read.
|
||||||
|
3. Trigger a full-cluster read from the back device into the cluster-sized buffer.
|
||||||
|
4. Write from the cluster-sized buffer into the newly allocated cluster.
|
||||||
|
5. Update the blob's on-disk metadata to record ownership of the newly allocated cluster. This
|
||||||
|
involves at least one page-sized write.
|
||||||
|
6. Write the new data to the just allocated and copied cluster.
|
||||||
|
|
||||||
|
If the source cluster is backed by a zeroes device, steps 2 through 4 are skipped. Alternatively, if
|
||||||
|
the blobstore resides on a device that can perform the copy on its own, steps 2 through 4 are
|
||||||
|
offloaded to the device.
|
||||||
|
|
||||||
### Sequences and Batches
|
### Sequences and Batches
|
||||||
|
|
||||||
Internally Blobstore uses the concepts of sequences and batches to submit IO to the underlying device in either
|
Internally Blobstore uses the concepts of sequences and batches to submit IO to the underlying device in either
|
||||||
|
Loading…
Reference in New Issue
Block a user