- [FEATURE] Restore BackingImage for BackupVolume in a new cluster [#4165](https://github.com/longhorn/longhorn/issues/4165)
## Motivation
### Goals
- When a Volume with a BackingImage being backed up, the BackingImage will also be backed up.
- User can manually back up the BackingImage.
- When restoring a Volume with a BackingImage, the BackingImage will also be restored.
- User can manually restore the BackingImage.
- All BackingImages are backed up in blocks.
- If the block contains the same data, BackingImages will reuse the same block in backup store instead of uploading another identical one.
## Proposal
### User Stories
With this feature, there is no need for user to manually handle BackingImage across cluster when backing up and restoring the Volumes with BackingImages.
### User Experience In Detail
Before this feature:
The BackingImage will not be backed up automatically when backing up a Volume with the BackingImage. So the user needs to prepare the BackingImage again in another cluster before restoring the Volume back.
After this feature:
A BackingImage will be backed up automatically when a Volume with the BackingImage is being backed up. User can also manually back up a BackingImage independently.
Then, when the Volume with the BackingImage is being restored from backup store, Longhorn will restore the BackingImage at the same time automatically. User can also manually restore the BackingImage independently.
This improve the user experience and reduce the operation overhead.
## Design
### Implementation Overview
#### Backup BackingImage - BackupStore
- Backup `BackingImage` is not the same as backup `Volume` which consists of a series of `Snapshots`. Instead, a `BackingImage` already has all the blocks we need to backup. Therefore, we don't need to find the delta between two `BackingImages` like what we do for`Snapshots` which delta might exist in other `Snapshots` between the current `Snapshot` and the last backup `Snapshot`.
- All the `BackingImages` share the same block pools in backup store, so we can reuse the blocks to increase the backup speed and save the space. This can happen when user create v1 `BackingImage`, use the image to add more data and then export another v2 `BackingImage`.
- For restoration, we still restore fully on one of the ready disk.
- Different from `Volume` backup, `BackingImage` does not have any size limit. It can be less than 2MB or not a multiple of 2MB. Thus, the last block might not be 2MB.
- When backing up `BackingImage`
1.`preload()`: the BackingImage to get the all the sectors that have data inside.
2.`createBackupBackingMapping()`: to get all the blocks we need to backup
- Block: offset + size (2MB for each block, last block might less than 2MB)
3.`backupMappings()`: write the block to the backup store
- if the block is already in the backup store, skip it.
4.`saveBackupBacking()`: save the metadata of the `BackupBackingImage` including the block mapping to the backup store. Mapping needs to include block size.
- When restoring `BackingImage`
-`loadBackupBacking()`: load the metadata of the `BackupBackingImage` from the backup store
- Field `Spec.ManagerAddress` indicates the address of the backing-image-manager running BackingImage backup.
- Field `Status.Checksum` records the checksum of the BackingImage. Users may create a new BackingImage with the same name but different content after deleting an old one or there is another BackingImage with the same name in another cluster. To avoid the confliction, we use checksum to check if they are the same.
- If cluster already has the `BackingImage` with the same name as in the backup store, we still create the `BackupBackingImage` CR. User can use the checksum to check if they are the same. Therefore we don't use `UUID` across cluster since user might already prepare the same BackingImage with the same name and content in another cluster.
2. Add a new controller `BackupBackingImageController`.
- Workflow
- Check and update the ownership.
- Do cleanup if the deletion timestamp is set.
- Cleanup the backup `BackingImage` on backup store
- If `Status.LastSyncedAt.IsZero() && Spec.BackingImageName != ""` means **it is created by the User/API layer**, we need to do the backup
- Start the monitor
- Pick one `BackingImageManager`
- Request `BackingImageManager` to backup the `BackingImage` by calling `CreateBackup()` grpc
- Else it means the `BackupBackingImage` CR is created by `BackupTargetController` and the backup `BackingImage` already exists in the remote backup target before the CR creation.
- Use `backupTargetClient` to get the info of the backup `BackingImage`
- Sync the status
3. In `BackingImageManager - manager(backing_image.go)`
- Implement `CreateBackup()` grpc
- Backup `BackingImage` to backup store in blocks
4. In controller `BackupTargetController`
- Workflow
- Implement `syncBackupBackingImage()` function
- Create the `BackupBackingImage` CRs whose name are in the backup store but not in the cluster
- Delete the `BackupBackingImage` CRs whose name are in the cluster but not in the backup store
- Request `BackupBackingImageController` to reconcile those `BackupBackingImage` CRs
5. Add a backup API for `BackingImage`
- Add new action `backup` to `BackingImage` (`"/v1/backingimages/{name}"`)
- create `BackupBackingImage` CR to init the backup process
- if `BackupBackingImage` already exists, it means there is already a `BackupBackingImage` in backup store, user can check the checksum to verify if they are the same.
- API Watch: establish a streaming connection to report BackupBackingImage info.
6. Trigger
- Back up through `BackingImage` operation manually
- Back up `BackingImage` when user back up the volume
- in `SnapshotBackup()` API
- we get the `BackingImage` of the `Volume`
- back up `BackingImage` if the `BackupBackingImage` does not exist
#### Restoring BackingImage - Controller
2. Add new data source type `restore` for `BackingImageDataSource`
- In `sync/service` implement `restoreFromBackupURL()` to restore the `BackingImage` from backup store to the local disk.
7. In `BackingImageDataSourceController`
- No need to change, it will take over control when `BackingImageDataSource` status is `ReadyForTransfer`.
- If it failed to restore the `BackingImage`, the status of the `BackingImage` will be failed and `BackingImageDataSourcePod` will be cleaned up and retry with backoff limit like `type=download`. The process is the same as other `BackingImage` creation process.
8. Trigger
- Restore through `BackingImage` operation manually
- Restore when user restore the `Volume` with `BackingImage`
- Restoring a Volume is actually requesting `Create` a Volume with `fromBackup` in the spec
- In `Create()` API we check if the `Volume` has `fromBackup` parameters and has `BackingImage`
- Check if `BackingImage` exists
- Check and restore `BackupBackingImage` if `BackingImage` does not exist
- Restore `BackupBackingImage` by creating `BackingImage` with type `restore` and `backupURL`
- Then Create the `Volume` CR so the admission webhook won't failed because of missing `BackingImage` ([ref](https://github.com/longhorn/longhorn-manager/blob/master/webhook/resources/volume/validator.go#L86))
- Restore when user create `Volume` through `CSI`
- In `CreateVolume()` we check if the `Volume` has `fromBackup` parameters and has `BackingImage`
- In `checkAndPrepareBackingImage()`, we restore `BackupBackingImage` by creating `BackingImage` with type `restore` and `backupURL`