diff --git a/enhancements/20210111-upgrade-engine-automatically.md b/enhancements/20210111-upgrade-engine-automatically.md new file mode 100644 index 0000000..d4a02aa --- /dev/null +++ b/enhancements/20210111-upgrade-engine-automatically.md @@ -0,0 +1,173 @@ +# Upgrade Volume's Engine Image Automatically + +## Summary + +Currently, users need to upgrade the engine image manually in the UI after upgrading Longhorn. +We should provide an option. When it is enabled, automatically upgrade engine images for volumes when it is applicable. + +### Related Issues + +https://github.com/longhorn/longhorn/issues/2152 + +## Motivation + +### Goals + +Reduce the amount of manual work users have to do when upgrading Longhorn by automatically upgrade engine images +for volumes when they are ok to upgrade. E.g. when we can do engine live upgrade or when volume is in detaching state. + +## Proposal + +Add a new boolean setting, `Concurrent Automatic Engine Upgrade Per Node Limit`, so users can control how Longhorn upgrade engines. +The value of this setting specifies the maximum number of engines per node that are allowed to upgrade to the default engine image at the same time. +If the value is 0, Longhorn will not automatically upgrade volumes' engines to default version. + +After user upgrading Longhorn to a new version, there will be a new default engine image and possibly a new default instance manager image. + +The proposal has 2 parts: +1. In which component we will do the upgrade. +2. Identify when is it ok to upgrade engine for volume to the new default engine image. + +For part 1, we will do the upgrade inside `engine image controller`. +The controller will constantly watch and upgrade the engine images for volumes when it is ok to upgrade. + +For part 2, we upgrade engine image for a volume when the following conditions are met: +1. The new default engine image is ready +1. Volume is not upgrading engine image +1. The volume condition is one of the following: + 1. Volume is in detached state. + 1. Volume is in attached state (live upgrade). + And volume is healthy. + And The current volume's engine image is compatible with the new default engine image. + And If volume is not a DR volume. + And Volume is not expanding. + +### User Stories + +Before this enhancement, users have to manually upgrade engine images for volume after upgrading Longhorn system to a newer version. +If there are thoudsands of volumes in the system, this is a significant manual work. + +After this enhancement users either have to do nothing (in case live upgrade is possible) +or they only have to scale down/up the workload (in case there is a new default IM image) + +### User Experience In Detail + +1. User upgrade Longhorn to a newer version. +The new Longhorn version is compatible with the volume's current engine image. +Longhorn automatically do live engine image upgrade for volumes + +2. User upgrade Longhorn to a newer version. +The new Longhorn version is not compatible with the volume's current engine image. +Users only have to scale the workload down and up. +This experience is similar to restart Google Chrome to use a new version. + +3. Note that users need to disable this feature if they want to update the engine image to a specific version for the volumes. +If `Concurrent Automatic Engine Upgrade Per Node Limit` setting is bigger than 0, Longhorn will not allow user to manually upgrade engine to a version other than the default version. + +### API changes + +No API change is needed. + +## Design + +### Implementation Overview + +1. Inside `engine image controller` sync function, get the value of the setting `Concurrent Automatic Engine Upgrade Per Node Limit` and assign it to concurrentAutomaticEngineUpgradePerNodeLimit variable. + If concurrentAutomaticEngineUpgradePerNodeLimit <= 0, we skip upgrading. +1. Find the new default engine image. Check if the new default engine image is ready. If it is not we skip the upgrade. + +1. List all volumes in Longhorn system. + Select a set of volume candidate for upgrading. + We select candidates that has the condition is one of the following case: + 1. Volume is in detached state. + 1. Volume is in attached state (live upgrade). + And volume is healthy. + And Volume is not upgrading engine image. + And The current volume's engine image is compatible with the new default engine image. + And the volume is not a DR volume. + And volume is not expanding. + +1. Make sure not to upgrade too many volumes on the same node at the same time. + Filter the upgrading candidate set so that total number of upgrading volumes and candidates per node is not over `concurrentAutomaticEngineUpgradePerNodeLimit`. +1. For each volume candidate, set `v.Spec.EngineImage = new default engine image` to update the engine for the volume. +1. If the engine upgrade failed to complete (e.g. the v.Spec.EngineImage != v.Status.CurrentImage), + we just consider it is the same as volume is in upgrading process and skip it. + Volume controller will handle the reconciliation when it is possible. + +### Test plan + + +Integration test plan. + +Preparation: +1. set up a backup store +2. Deploy a compatible new engine image + +Case 1: Concurrent engine upgrade +1. Create 10 volumes each of 1Gb. +2. Attach 5 volumes vol-0 to vol-4. Write data to it +3. Upgrade all volumes to the new engine image +4. Wait until the upgrades are completed (volumes' engine image changed, + replicas' mode change to RW for attached volumes, reference count of the + new engine image changed, all engine and replicas' engine image changed) +5. Set concurrent-automatic-engine-upgrade-per-node-limit setting to 3 +6. In a retry loop, verify that the number of volumes who + is upgrading engine is always smaller or equal to 3 +7. Wait until the upgrades are completed (volumes' engine image changed, + replica mode change to RW for attached volumes, reference count of the + new engine image changed, all engine and replicas' engine image changed, + etc ...) +8. verify the volumes' data + +Case 2: Dr volume +1. Create a backup for vol-0. Create a DR volume from the backup +2. Try to upgrade the DR volume engine's image to the new engine image +3. Verify that the Longhorn API returns error. Upgrade fails. +4. Set concurrent-automatic-engine-upgrade-per-node-limit setting to 0 +5. Try to upgrade the DR volume engine's image to the new engine image +6. Wait until the upgrade are completed (volumes' engine image changed, + replicas' mode change to RW, reference count of the new engine image + changed, engine and replicas' engine image changed) +7. Wait for the DR volume to finish restoring +8. Set concurrent-automatic-engine-upgrade-per-node-limit setting to 3 +9. In a 2-min retry loop, verify that Longhorn doesn't automatically + upgrade engine image for DR volume. + +Case 3: Expanding volume +1. set concurrent-automatic-engine-upgrade-per-node-limit setting to 0 +2. Upgrade vol-0 to the new engine image +3. Wait until the upgrade are completed (volumes' engine image changed, + replicas' mode change to RW, reference count of the new engine image + changed, engine and replicas' engine image changed) +4. Detach vol-0 +5. Expand the vol-0 from 1Gb to 5GB +6. Wait for the vol-0 to start expanding +7. Set concurrent-automatic-engine-upgrade-per-node-limit setting to 3 +8. While vol-0 is expanding, verify that its engine is not upgraded to + the default engine image +9. Wait for the expansion to finish and vol-0 is detached +10. Verify that Longhorn upgrades vol-0's engine to the default version + +Case 4: Degraded volume +1. set concurrent-automatic-engine-upgrade-per-node-limit setting to 0 +2. Upgrade vol-1 (an healthy attached volume) to the new engine image +3. Wait until the upgrade are completed (volumes' engine image changed, + replicas' mode change to RW, reference count of the new engine image + changed, engine and replicas' engine image changed) +4. Increase number of replica count to 4 to make the volume degraded +5. Set concurrent-automatic-engine-upgrade-per-node-limit setting to 3 +6. In a 2-min retry loop, verify that Longhorn doesn't automatically + upgrade engine image for vol-1. + +Cleaning up: +1. Clean up volumes +2. Reset automatically-upgrade-engine-to-default-version setting in + the client fixture + +### Upgrade strategy + +No upgrade strategy is needed. + + +### Addtional Context +None \ No newline at end of file