longhorn/enhancements/20220110-extend-csi-snapshot-to-support-longhorn-snapshot.md
David Ko 4f35fda4b2 docs: fix typo
Signed-off-by: David Ko <dko@suse.com>
2022-12-12 14:20:19 +08:00

291 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Title
Extend CSI snapshot to support Longhorn snapshot
## Summary
Before this feature, if the user uses [the CSI Snapshotter mechanism](https://kubernetes-csi.github.io/docs/snapshot-restore-feature.html),
they can only create Longhorn backups (out of cluster). We want to extend the CSI Snapshotter to support creating for
Longhorn snapshot (in-cluster) as well.
### Related Issues
https://github.com/longhorn/longhorn/issues/2534
## Motivation
### Goals
Extend the CSI Snapshotter to support:
* Creating Longhorn snapshot
* Deleting Longhorn snapshot
* Creating a new PVC from a CSI snapshot that is associated with a Longhorn snapshot
### Non-goals
* Longhorn snapshot Reverting is not a goal because CSI snapshotter doesn't support replace in place for now:
https://github.com/container-storage-interface/spec/blob/master/spec.md#createsnapshot
## Proposal
### User Stories
Before this feature is implemented, users can only use CSI Snapshotter to create/restore Longhorn backups.
This means that users must set up a backup target outside of the cluster. Uploading/downloading data from
backup target is a long/costly operation. Sometimes, users might just want to use CSI Snapshotter to take
an in-cluster Longhorn snapshot and create a new volume from that snapshot. The Longhorn snapshot operation
is cheap and faster than the backup operation and doesn't require setting up a backup target.
### User Experience In Detail
To use this feature, users need to do:
1. Deploy the CSI snapshot CRDs, Controller as instructed at https://longhorn.io/docs/1.2.3/snapshots-and-backups/csi-snapshot-support/enable-csi-snapshot-support/
1. Deploy a VolumeSnapshotClass with the parameter `type: longhorn-snapshot`. I.e.,
```yaml
kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1beta1
metadata:
name: longhorn-snapshot
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
type: longhorn-snapshot
```
1. To create a new CSI snapshot associated with a Longhorn snapshot of the volume `test-vol`, users deploy the following VolumeSnapshot CR:
```yaml
apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
name: test-snapshot
spec:
volumeSnapshotClassName: longhorn-snapshot
source:
persistentVolumeClaimName: test-vol
```
A new Longhorn snapshot is created for the volume `test-vol`
1. To create a new PVC from the CSI snapshot, users can deploy the following yaml:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-restore-snapshot-pvc
spec:
storageClassName: longhorn
dataSource:
name: test-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi # should be the same as the size of `test-vol`
```
A new PVC will be created with the same content as in the VolumeSnapshot `test-snapshot`
1. Deleting the VolumeSnapshot `test-snapshot` will lead to the deletion of the corresponding Longhorn snapshot of the volume `test-vol`
### API changes
None
## Design
### Implementation Overview
We follow the specification in [the CSI spec](https://github.com/container-storage-interface/spec/blob/master/spec.md#createsnapshot) when supporting the CSI snapshot.
We define a new parameter in the VolumeSnapshotClass `type`.
The value of the parameter `type` can be `longhorn-snapshot` or `longhorn-backup`.
When `type` is `longhorn-snapshot` it means that the CSI VolumeSnapshot created with this VolumeSnapshotClass is associated with a Longhorn snapshot.
When `type` is `longhorn-backup` it means that the CSI VolumeSnapshot created with this VolumeSnapshotClass is associated with a Longhorn backup.
In [CreateSnapshot function](https://github.com/longhorn/longhorn-manager/blob/878cfb868c568396d6ebfa4ce096c5d95d9b31e3/csi/controller_server.go#L539), we get the
value of parameter `type`. If it is `longhorn-backup`, we take a Longhorn backup as before. If it is `longhorn-snapshot` we do:
* Get the name of the Longhorn volume
* Check if the volume is in attached state.
If it is not, return `codes.FailedPrecondition`.
We cannot take a snapshot of non-attached volume.
* Check if a Longhorn snapshot with the same name as the requested CSI snapshot already exists.
If yes, return OK without taking a new Longhorn snapshot.
* Take a new Longhorn snapshot. Encode the snapshotId in the format `snap://volume-name/snapshot-name`.
This snaphotId will be used in the later CSI CreateVolume and DeleteSnapshot call.
In [CreateVolume function](https://github.com/longhorn/longhorn-manager/blob/878cfb868c568396d6ebfa4ce096c5d95d9b31e3/csi/controller_server.go#L63):
* If the VolumeContentSource is a `VolumeContentSource_Snapshot` type, decode the snapshotId in the format from the above step.
* Create a new volume with the `dataSource` set to `snap://volume-name/snapshot-name`. This will trigger Longhorn to clone the content of the snapshot to the new volume.
Note that if the source volume is not attached, Longhorn cannot verify the existence of the snapshot inside the Longhorn volume.
This means that [the API will return error](https://github.com/longhorn/longhorn-manager/blob/878cfb868c568396d6ebfa4ce096c5d95d9b31e3/manager/volume.go#L347-L352) and new PVC cannot be provisioned.
In [DeleteSnapshot function](https://github.com/longhorn/longhorn-manager/blob/878cfb868c568396d6ebfa4ce096c5d95d9b31e3/csi/controller_server.go#L675):
* Decode the snapshotId in the format from the above step.
If the type is `longhorn-backup` we delete the backup as before.
If the type is `longhorn-snapshot`, we delete the corresponding Longhorn snapshot of the source volume.
If the source volume or the snapshot is no longer exist, we return OK as specified in [the CSI spec](https://github.com/container-storage-interface/spec/blob/master/spec.md#deletesnapshot)
### Test plan
Integration test plan.
1. Deploy the CSI snapshot CRDs, Controller as instructed at https://longhorn.io/docs/1.2.3/snapshots-and-backups/csi-snapshot-support/enable-csi-snapshot-support/
1. Deploy 4 VolumeSnapshotClass:
```yaml
kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1beta1
metadata:
name: longhorn-backup-1
driver: driver.longhorn.io
deletionPolicy: Delete
```
```yaml
kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1beta1
metadata:
name: longhorn-backup-2
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
type: longhorn-backup
```
```yaml
kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1beta1
metadata:
name: longhorn-snapshot
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
type: longhorn-snapshot
```
```yaml
kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1beta1
metadata:
name: invalid-class
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
type: invalid
```
1. Create Longhorn volume `test-vol` of 5GB. Create PV/PVC for the Longhorn volume.
1. Create a workload that uses the volume. Write some data to the volume.
Make sure data persist to the volume by running `sync`
1. Set up a backup target for Longhorn
#### Scenarios 1: CreateSnapshot
* `type` is `longhorn-backup` or `""`
* Create a VolumeSnapshot with the following yaml
```yaml
apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
name: test-snapshot-longhorn-backup
spec:
volumeSnapshotClassName: longhorn-backup-1
source:
persistentVolumeClaimName: test-vol
```
* Verify that a backup is created.
* Delete the `test-snapshot-longhorn-backup`
* Verify that the backup is deleted
* Create the `test-snapshot-longhorn-backup` VolumeSnapshot with `volumeSnapshotClassName: longhorn-backup-2`
* Verify that a backup is created.
* `type` is `longhorn-snapshot`
* volume is in detached state.
* Scale down the workload of `test-vol` to detach the volume.
* Create `test-snapshot-longhorn-snapshot` VolumeSnapshot with `volumeSnapshotClassName: longhorn-snapshot`.
* Verify the error `volume ... invalid state ... for taking snapshot` in the Longhorn CSI plugin.
* volume is in attached state.
* Scale up the workload to attach `test-vol`
* Verify that a Longhorn snapshot is created for the `test-vol`.
* invalid type
* Create `test-snapshot-invalid` VolumeSnapshot with `volumeSnapshotClassName: invalid-class`.
* Verify the error `invalid snapshot type: %v. Must be %v or %v or` in the Longhorn CSI plugin.
* Delete `test-snapshot-invalid` VolumeSnapshot.
#### Scenarios 2: Create new volume from CSI snapshot
* From `longhorn-backup` type
* Create a new PVC with the flowing yaml:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-restore-pvc
spec:
storageClassName: longhorn
dataSource:
name: test-snapshot-longhorn-backup
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
```
* Attach the PVC `test-restore-pvc` and verify the data
* Delete the PVC
* From `longhorn-snapshot` type
* Source volume is attached && Longhorn snapshot exist
* Create a PVC with the following yaml:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-restore-pvc
spec:
storageClassName: longhorn
dataSource:
name: test-snapshot-longhorn-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
```
* Attach the PVC `test-restore-pvc` and verify the data
* Delete the PVC
* Source volume is detached
* Scale down the workload to detach the `test-vol`
* Create the same PVC `test-restore-pvc` as in the `Source volume is attached && Longhorn snapshot exist` section
* Verify that PVC provisioning failed because the source volume is detached so Longhorn cannot verify the existence of the Longhorn snapshot in the source volume.
* Scale up the workload to attach `test-vol`
* Wait for PVC to finish provisioning and be bounded
* Attach the PVC `test-restore-pvc` and verify the data
* Delete the PVC
* Source volume is attached && Longhorn snapshot doesnt exist
* Find the VolumeSnapshotContent of the VolumeSnapshot `test-snapshot-longhorn-snapshot`.
Find the Longhorn snapshot name inside the field `VolumeSnapshotContent.snapshotHandle`.
Go to Longhorn UI. Delete the Longhorn snapshot.
* Repeat steps in the section `Longhorn snapshot exist` above.
PVC should be stuck in provisioning because Longhorn snapshot of the source volume doesn't exist.
* Delete the PVC `test-restore-pvc` PVC
#### Scenarios 3: Delete CSI snapshot
* `longhorn-backup` type
* Done in the above step
* `longhorn-snapshot` type
* volume is attached && snapshot doesnt exist
* Delete the VolumeSnapshot `test-snapshot-longhorn-snapshot` and verify that the VolumeSnapshot is deleted.
* volume is attached && snapshot exist
* Recreate the VolumeSnapshot `test-snapshot-longhorn-snapshot`
* Verify the creation of Longhorn snapshot with the name in the field `VolumeSnapshotContent.snapshotHandle`
* Delete the VolumeSnapshot `test-snapshot-longhorn-snapshot`
* Verify that Longhorn snapshot is removed or marked as removed
* Verify that the VolumeSnapshot `test-snapshot-longhorn-snapshot` is deleted.
* volume is detached
* Recreate the VolumeSnapshot `test-snapshot-longhorn-snapshot`
* Scale down the workload to detach `test-vol`
* Delete the VolumeSnapshot `test-snapshot-longhorn-snapshot`
* Verify that VolumeSnapshot `test-snapshot-longhorn-snapshot` is stuck in deleting
### Upgrade strategy
No upgrade strategy needed
## Note [optional]
We need to update the docs and examples to reflect the new parameter in the VolumeSnapshotClass, `type`.