22 KiB
22 KiB
Backing Image
Summary
Longhorn can set a backing image of a Longhorn volume, which is designed for VM usage.
Related Issues
https://github.com/longhorn/longhorn/issues/2006 https://github.com/longhorn/longhorn/issues/2295
Motivation
Goals
- A qcow2 or raw image file can be used as the backing image of a volume.
- The backing image works fine with backup or restore.
- Multiple replicas in the same disk can share one backing image.
- One backing image should be downloaded from remote once then delivered to other nodes by Longhorn.
Non-goals:
This feature is more responsible for fixing issue mentioned in issue #1948.
Proposal
- Launch a new kind of workload
backing-image-manager
to handle all backing images for each disk.- Supports pulling a image file from remote URLs, or syncing a image file from other managers.
- Supports reuse existing backing image files. Since we will consider those files are immutable/read-only once downloaded, backing image managers should be able to directly take the files if the work directories & meta info match. Notice that the file checksum won't be checked and stored in the 1st version, but it can be introduced later if necessary.
- Supports Live upgrade: Different from instance managers, backing images are just files. We can directly shut down the old backing image manager pods, then let new backing image manager pods rely on the reuse mechanism to take over the existing files.
- All image files will be periodically checked by managers.
- If the disk is not available/gets replaced, the backing image manager cannot do anything or simply report all backing images failed.
- Once there is a modification for an image, managers will notify callers via the gRPC streaming.
- For
longhorn-manager
:- Similar to engines/replicas vs instance managers, there will be 2 new CRDs
backingimages.longhorn.io
andbackingimagemanagers.longhorn.io
.- Since some disks won't be chosen by replicas using backing images, there should be a disk map in each
backingimages.longhorn.io
CR spec to indicate in which disk/node a backing image should be downloaded. - Considering the disk migration and failed replica reusage features, there will be an actual backing image file for each disk rather than each node.
- For CR
backingimages.longhorn.io
, the spec is responsible for recording URL and DiskMap. The status records the file status for each disk as well as some metainfo like size & UUID, then be presented by UI. - For CR
backingimagemanagers.longhorn.io
, the spec should record disk info, backing image list, and backing image manager pod image. And the status will be synced with the report from the pods, which reflects the actual status of the backing image files.
- Since some disks won't be chosen by replicas using backing images, there should be a disk map in each
- The backing image of a Longhorn volume should be downloaded by someone before starting the related volume replicas. Before sending requests to launch replica progresses, replica controller will check and wait for the backing image ready if a backing image is set for the related volumes.
- This is a common logic that is available for not only normal volumes but also restoring/DR volumes.
- We need to make sure the backing image names as well as the download address are stored in backup volumes. So users are able to re-specify the backing image when restoring a volume in case of the original image becoming invalid.
- BackingImageController is responsible for:
- Generate a UUID for each new backing image.
- Handle backing image manager life cycle.
- Sync download status/info with backing image manager status.
- Set timestamp if there is no replica using the backing image in a disk.
- BackingImageManagerController is responsible for:
- Create pods to handle backing image files.
- Handle files based on the spec:
- Delete unused backing images.
- Download backing images: If there is no file among all managers, it should follow a specific logic to pick up one available default backing image manager, then send a pull request. Otherwise, the current manager will fetch the file from other managers.
- There should be a cleanup timeout setting and a related timestamp that indicates when a backing image can be removed from a disk when no replica in the disk is using the backing image.
- Similar to engines/replicas vs instance managers, there will be 2 new CRDs
- For
longhorn-engine
:- Most of the backing image related logic is already in there.
- The raw image support will be introduced.
- Make sure the backing file path will be updated each time when the replica starts.
- As we mentioned above, there should be backing image manager pods managing all backing images.
- One backing image pod for one disk. If there is no disk on a node, there is no need to launch a manager pod. In other words, this is similar to replica instance manager.
- A gRPC service will be launched in order to communicate with longhorn managers.
- These operations should be considered:
Pull
(download from remote),Sync
(request a backing image file from other manager pods),Send
(send a backing image file to other manager pods),Watch
(notifying the manager that the status of a backing image is updated), andVersionGet
. - The pulling/sync progress should be calculated and reported to the manager.
- A existing backing image file can be reused.
- To notify the longhorn manager, a gRPC streaming will be used for API
Watch
. - A monitor goroutine will periodically check all backing image files.
User Stories
Rebuild replica for a large volume after network fluctuation/node reboot
Before the enhancement, users need to manually copy the backing image data to the volume in advance.
After the enhancement, users can directly specify the backing image during volume creation/restore with a click. And one backing image can be shared among all replicas in the same disk.
User Experience In Detail
- Users can modify the backing image cleanup timeout setting so that all non-used backing images will be cleaned up automatically from disks.
- Create a volume with a backing image
2.1. via Longhorn UI
1. Users add a backing image, which is similar to add an engine image or set up the backup target in the system.
2. Users create/restore a volume with the backing image specified from the backing image list.
2.2. via CSI (StorageClass)
1. Users specify
backingImageName
andbackingImageAddress
in a StorageClass. 2. Users use this StorageClass to create a PVC. When the PVC is created, Longhorn will automatically create a volume as well as the backing image if not exists. - Users attach the volume to a node (via GUI or Kubernetes). Longhorn will automatically download the related backing image to the disks the volume replica are using. In brief, users don't need to do anything more for the backing image.
- When users backup a volume with a backing image, the backing image info will be recorded in the backup but the actual backing image data won't be uploaded to the backupstore. Instead, the backing image will be re-downloaded from the original image once it's required.
API Changes
- A bunch of RESTful APIs is required for the new CRD "backing image":
Create
,Delete
,List
, andBackingImageCleanup
. - Now the volume creation API receives parameter
BackingImage
.
Design
Implementation Overview
longhorn-manager:
- In settings:
- Add a setting
Backing Image Cleanup Wait Interval
. - Add a read-only setting
Default Backing Image Manager Image
.
- Add a setting
- Add a new CRD
backingimages.longhorn.io
.- Field
Spec.Disks
records the disks that the backing images need to be downloaded to. - Field
Status.DiskStatusMap
is designed to reflect the actual download status for the disks. And fieldBackingImageDownloadState
is the value of mapStatus.DiskStatusMap
. It can bedownloaded
/downloading
/pending
/failed
/unknown
. - Field
Status.DiskDownloadProgressMap
will report the pulling/syncing progress for downloading files. - Field
Status.UUID
should be generated and stored in ETCD before other operations. Considering users may create a new backing image with the same name & different URL after deleting an old backing image, to avoid the possible leftover of the old backing image disturbing the new one, the manager can use a UUID to generate the work directory.
- Field
- Add a new CRD
backingimagemanagers.longhorn.io
.- Field
Spec.BackingImages
records which backing images should be downloaded by the manager. the key is backing image name, the value is backing image UUID. - Field
Status.BackingImageFileMap
will be updated according to the actual file status reported by the related manager pod.
- Field
- Add a new controller
BackingImageManagerController
.- Important notices:
- Need to consider 2 kinds of managers: default manager, old manager(this includes all incompatible managers).
- All old managers will be removed immediately once there is the default image of backing image manager is updated. And old managers shouldn't operate any backing image files.
- When an old manager is removed, the files handled by this manager won't be removed. All backing image requests will be taken by the corresponding new managers. By disabling old managers operating the files, the conflicts with the default manager won't happen.
- Then the controller can directly delete old backing image managers without affecting existing backing images. This simplifies the cleanup flow. And new managers will take over all existing and required backing image files with the reuse mechanism.
- Ideally there should be a cleanup mechanism that is responsible for removing all failed backing images as well as the images no longer required by the new backing image managers. But due to lacking of time, it will be implemented in the future.
- For default managers, the controller will directly send pull or sync requests to the new managers for all required backing images. If the files are already downloaded by the old managers, the files can be directly reused. This is actually a live upgrade for the backing image managers.
- In most cases, the controller and the backing image manager will avoid deleting backing images files.:
- For example, if the pod is crashed or one image file becomes failed, the controller will directly restart the pod or re-download the image, rather than cleaning up the files only.
- The controller will delete image files for only 2 cases: A backing images is no longer valid; A default backing image manager CR is deleted.
- By following this strategy, we may risk at leaving some unused backing image files in some corner cases. However, the gain is that, there is lower probability of crashing a replica caused by deleting the backing image file deletion. Besides, the existing files can be reused after recovery. And after introducing the cleanup mechanism, we should worry about the leftover anymore.
- The pod not running doesn't mean all files handled by the pod become invalid. All files can be reused/re-monitored after the pod restarting.
- Workflow:
- If the deletion timestamp is set, the controller will clean up files for running default backing image managers only. Then it will blindly delete the related pods.
- If there is no ready disk or node based on the disk & node info in backing image manager spec, the current manager will be marked as
unknown
. Then all not-failed backing images are considered asunknown
as well.- Actually there are multiple subcases here: node down, node reboot, node disconnection, disk detachment, longhorn manager pod missing etc. It's complicated to distinguish all subcases to do something sepcific. Hence, I choose simply marking the state to
unknown
.
- Actually there are multiple subcases here: node down, node reboot, node disconnection, disk detachment, longhorn manager pod missing etc. It's complicated to distinguish all subcases to do something sepcific. Hence, I choose simply marking the state to
- Create backing image manager pods for.
- If the old status is
running
but the pod is not ready now, there must be something wrong with the manager pod. Hence the controller need to update the state toerror
. - When the pod is ready, considering the case that the pod creation may succeed but the CR status update will fail due to conflicts, the controller won't check the previous state. Instead, it will directly update state to
running
. - Start a monitor goroutine for each running pods.
- If the manager is state
error
, the controller will do cleanup then recreate the pod.
- If the old status is
- Handle files based on the spec:
- Delete invalid backing images:
- The backing images is no longer in
BackingImageManager.Spec.BackingImages
. - The backing image UUID doesn't match.
- The backing images is no longer in
- Download backing images for default managers:
- If there is no existing file in any running manager pods (including pods not using the default image), the controller will sort all available default managers, then send a
Pull
call to the first manager. This means that the file will be downloaded only once among all manager pods in most of the cases. Notice that it's best-effort rather than guaranteed. - Otherwise, the current manager will try to fetch the file from other managers:
- If the 1st file is still being downloaded, do nothing.
- Each manager can send a downloaded backing image file to 3 other managers simultaneously at max. When there is no available sender, do nothing.
- If there is no existing file in any running manager pods (including pods not using the default image), the controller will sort all available default managers, then send a
- Delete invalid backing images:
- For the monitor goroutine, it's similar to that in InstanceManagerController.
- It will
List
all backing image files once it receives the notification from the streaming. - If there are 10 continuous errors returned by the streaming receive function, the monitor goroutine will stop itself. Then the controller will restart it.
- It will
- Important notices:
- Add a new controller
BackingImageController
.- Important notices:
- The main responsibility of this controller is creating, deleting, and update backing image managers. There is no gRPC call with the related backing image manager pod in this controller.
- Besides recording the immutable UUID, the backing image status is used to record the file info in the managers status and present to users.
- Always try to create default backing image managers if not exist.
- Aggressively delete non-default backing image managers.
- Workflow:
- If the deletion timestamp is set, the controller need to do cleanup for all related backing image managers.
- Generate a UUID for each new backing image. Make sure the UUID is stored in ETCD before doing anything others.
- Init the maps in the backing image status.
- Handle backing image manager life cycle:
- Remove records in
Spec.BackingImages
or directly delete the manager CR - Add records to
Spec.BackingImages
for the current backing image. Create backing image manager CRs with default image if not exist.
- Remove records in
- Sync download status/info with backing image manager status:
- Blindly update
Status.DiskDownloadStateMap
andStatus.DiskDownloadProgressMap
- Set
Status.Size
if it's 0. If somehow the size is not same among all backing image managers, this means there is an unknown bug. Currently there is no way to automatically recover it since Longhorn doesn't know which backing image manager holds the correct file.
- Blindly update
- Set timestamp in
Status.DiskLastRefAtMap
if there is no replica using the backing image in a disk. Later NodeController will do cleanup forSpec.DiskDownloadMap
based on the timestamp.
- Important notices:
- In Replica Controller:
- Request downloading the image into a disk if a backing image used by a replica doesn't exist.
- Check and wait for backing image disk map in the status before sending requests to replica instance managers.
- In Node Controller:
- Determine if the disk needs to be cleaned up if checking backing image
Status.DiskLastRefAtMap
and the wait intervalBackingImageCleanupWaitInterval
. - Update the spec for backing image managers when there is a disk migration.
- Determine if the disk needs to be cleaned up if checking backing image
- For the API volume creation:
- Longhorn needs to verify the backing image if it's specified.
- For restore/DR volumes, the backing image name stored in the backup volume will be used automatically if users do not specify the backing image name.
- In CSI:
- Check the backing image info during the volume creation.
- The missing backing image will be created when both backing image name and address are provided.
longhorn-engine:
- Verify the existing implementation and the related integration tests.
- Add raw backing file support.
- Update the backing file info for replicas when a replica is created/opened.
backing-image-manager:
- As I mentioned above, we will use backing image UUID to generate work directories for each backing image. The work directory is like:
<Disk path in container>/backing-images/ <Disk path in container>/backing-images/<Downloading backing image1 name>-<Downloading backing image1 UUID>/backing.tmp <Disk path in container>/backing-images/<Downloaded backing image1 name>-<Downloaded backing image UUID>/backing <Disk path in container>/backing-images/<Downloaded backing image1 name>-<Downloaded backing image UUID>/backing.cfg
- There is a goroutine periodically check the file existence based on the image file current state.
- It will verify the disk UUID in the disk config file. If there is a mismatching, it will stop checking existing files. And the calls, longhorn manager pods, won't send requests since this backing image manager is marked as unknown.
- The manager will provide one channel for all backing images. If there is a update in a backing image, the image will send a signal to the channel. Then there is another goroutine receive the channel and notify the longhorn manager via streaming.
- Launch a gRPC service with the following APIs.
- API
Pull
: Register the image then download the file from a URL. For a failed backing image, the manager will re-register then re-pull it.- Before starting download, the image will check if there are existing files in the current work directory. It the files exist and the info in the cfg file matches the current status, the file will be directly reused and the actual pulling will be skipped.
- Otherwise, the work directory will be cleaned up and recreated.
- As the 1st step of download starting, a cancelled context will be created. Then the image will use a HTTP request with this context to download the file. When the image is removed during downloading, or the download gets stuck for a while(the timeout is 4s for now), we can directly cancel the context to stop the download.
- The backing image manager will wait for 30s at max for downloading start. If time exceeds, the backing image will be marked as failed.
- The download file is named as
backing.tmp
. Once the download complete, the file will be renamed tobacking
, the meta info/status will be recorded in the config filebacking.cfg
, and the state will be updated. - Each time when the image downloads a chunk of data, the progress will be updated. For the first time updating the progress, it means the downloading starts and the state will be updated to
downloading
.
- API
Sync
: Register the image, start a receiving server, and ask another manager to send the file via APISend
. For a failed backing image, the manager will re-register then re-sync it. This should be similar to replica rebuilding.- Similar to
Pull
, the image will try to reuse existing files. - The manager is responsible for managing all port. The image will use the functions provided by the manager to get then release ports.
- Similar to
- API
Send
: Send a backing image file to a receiver. This should be similar to replica rebuilding. - API
Delete
: Unregister the image then delete the imge work directory. Make sure syncing or pulling will be cancelled if exists. - API
Get
/List
: Collect the status of one backing image/all backing images.
- API
longhorn-ui:
- Launch a new page to present and operate backing images.
- Show the image (download) status for each disk based on
Status.DiskStatusMap
andStatus.DiskDownloadProgressMap
when users ClickDetail
of one backing image.- If the state is
downloading
, the progress will be presented as well.
- If the state is
- Add the following operating list:
Create
: The required field isimage
.Delete
: No field is required. It should be disabled when there is one replica using the backing image.CleanupDiskImages
: This allows users to manually clean up the images in some disks in advance. It's a batch operation.
- Show the image (download) status for each disk based on
- Allow choosing a backing image for volume creation.
- Allow choosing/re-specifying a new backing image for restore/DR volume creation:
- If there is backing image info in the backup volume, an option
Use previous backing image
will be shown and checked by default. - If the option is unchecked by users, UI will show the backing image list so that users can pick up it.
- If there is backing image info in the backup volume, an option
Test Plan
Integration tests
- Backing image basic operation
- Backing image auto cleanup
- Backing image with disk migration
Manual tests
- The backing image on a down node
- The backing image works fine with system upgrade & backing image manager upgrade
- The incompatible backing image manager handling
Upgrade strategy
N/A