Add LEP for feature automatically upgrade engine for volumes
Longhorn #2152 Signed-off-by: Phan Le <phan.le@rancher.com>
This commit is contained in:
parent
0fd0c33e06
commit
ee812f00ee
173
enhancements/20210111-upgrade-engine-automatically.md
Normal file
173
enhancements/20210111-upgrade-engine-automatically.md
Normal file
@ -0,0 +1,173 @@
|
||||
# Upgrade Volume's Engine Image Automatically
|
||||
|
||||
## Summary
|
||||
|
||||
Currently, users need to upgrade the engine image manually in the UI after upgrading Longhorn.
|
||||
We should provide an option. When it is enabled, automatically upgrade engine images for volumes when it is applicable.
|
||||
|
||||
### Related Issues
|
||||
|
||||
https://github.com/longhorn/longhorn/issues/2152
|
||||
|
||||
## Motivation
|
||||
|
||||
### Goals
|
||||
|
||||
Reduce the amount of manual work users have to do when upgrading Longhorn by automatically upgrade engine images
|
||||
for volumes when they are ok to upgrade. E.g. when we can do engine live upgrade or when volume is in detaching state.
|
||||
|
||||
## Proposal
|
||||
|
||||
Add a new boolean setting, `Concurrent Automatic Engine Upgrade Per Node Limit`, so users can control how Longhorn upgrade engines.
|
||||
The value of this setting specifies the maximum number of engines per node that are allowed to upgrade to the default engine image at the same time.
|
||||
If the value is 0, Longhorn will not automatically upgrade volumes' engines to default version.
|
||||
|
||||
After user upgrading Longhorn to a new version, there will be a new default engine image and possibly a new default instance manager image.
|
||||
|
||||
The proposal has 2 parts:
|
||||
1. In which component we will do the upgrade.
|
||||
2. Identify when is it ok to upgrade engine for volume to the new default engine image.
|
||||
|
||||
For part 1, we will do the upgrade inside `engine image controller`.
|
||||
The controller will constantly watch and upgrade the engine images for volumes when it is ok to upgrade.
|
||||
|
||||
For part 2, we upgrade engine image for a volume when the following conditions are met:
|
||||
1. The new default engine image is ready
|
||||
1. Volume is not upgrading engine image
|
||||
1. The volume condition is one of the following:
|
||||
1. Volume is in detached state.
|
||||
1. Volume is in attached state (live upgrade).
|
||||
And volume is healthy.
|
||||
And The current volume's engine image is compatible with the new default engine image.
|
||||
And If volume is not a DR volume.
|
||||
And Volume is not expanding.
|
||||
|
||||
### User Stories
|
||||
|
||||
Before this enhancement, users have to manually upgrade engine images for volume after upgrading Longhorn system to a newer version.
|
||||
If there are thoudsands of volumes in the system, this is a significant manual work.
|
||||
|
||||
After this enhancement users either have to do nothing (in case live upgrade is possible)
|
||||
or they only have to scale down/up the workload (in case there is a new default IM image)
|
||||
|
||||
### User Experience In Detail
|
||||
|
||||
1. User upgrade Longhorn to a newer version.
|
||||
The new Longhorn version is compatible with the volume's current engine image.
|
||||
Longhorn automatically do live engine image upgrade for volumes
|
||||
|
||||
2. User upgrade Longhorn to a newer version.
|
||||
The new Longhorn version is not compatible with the volume's current engine image.
|
||||
Users only have to scale the workload down and up.
|
||||
This experience is similar to restart Google Chrome to use a new version.
|
||||
|
||||
3. Note that users need to disable this feature if they want to update the engine image to a specific version for the volumes.
|
||||
If `Concurrent Automatic Engine Upgrade Per Node Limit` setting is bigger than 0, Longhorn will not allow user to manually upgrade engine to a version other than the default version.
|
||||
|
||||
### API changes
|
||||
|
||||
No API change is needed.
|
||||
|
||||
## Design
|
||||
|
||||
### Implementation Overview
|
||||
|
||||
1. Inside `engine image controller` sync function, get the value of the setting `Concurrent Automatic Engine Upgrade Per Node Limit` and assign it to concurrentAutomaticEngineUpgradePerNodeLimit variable.
|
||||
If concurrentAutomaticEngineUpgradePerNodeLimit <= 0, we skip upgrading.
|
||||
1. Find the new default engine image. Check if the new default engine image is ready. If it is not we skip the upgrade.
|
||||
|
||||
1. List all volumes in Longhorn system.
|
||||
Select a set of volume candidate for upgrading.
|
||||
We select candidates that has the condition is one of the following case:
|
||||
1. Volume is in detached state.
|
||||
1. Volume is in attached state (live upgrade).
|
||||
And volume is healthy.
|
||||
And Volume is not upgrading engine image.
|
||||
And The current volume's engine image is compatible with the new default engine image.
|
||||
And the volume is not a DR volume.
|
||||
And volume is not expanding.
|
||||
|
||||
1. Make sure not to upgrade too many volumes on the same node at the same time.
|
||||
Filter the upgrading candidate set so that total number of upgrading volumes and candidates per node is not over `concurrentAutomaticEngineUpgradePerNodeLimit`.
|
||||
1. For each volume candidate, set `v.Spec.EngineImage = new default engine image` to update the engine for the volume.
|
||||
1. If the engine upgrade failed to complete (e.g. the v.Spec.EngineImage != v.Status.CurrentImage),
|
||||
we just consider it is the same as volume is in upgrading process and skip it.
|
||||
Volume controller will handle the reconciliation when it is possible.
|
||||
|
||||
### Test plan
|
||||
|
||||
|
||||
Integration test plan.
|
||||
|
||||
Preparation:
|
||||
1. set up a backup store
|
||||
2. Deploy a compatible new engine image
|
||||
|
||||
Case 1: Concurrent engine upgrade
|
||||
1. Create 10 volumes each of 1Gb.
|
||||
2. Attach 5 volumes vol-0 to vol-4. Write data to it
|
||||
3. Upgrade all volumes to the new engine image
|
||||
4. Wait until the upgrades are completed (volumes' engine image changed,
|
||||
replicas' mode change to RW for attached volumes, reference count of the
|
||||
new engine image changed, all engine and replicas' engine image changed)
|
||||
5. Set concurrent-automatic-engine-upgrade-per-node-limit setting to 3
|
||||
6. In a retry loop, verify that the number of volumes who
|
||||
is upgrading engine is always smaller or equal to 3
|
||||
7. Wait until the upgrades are completed (volumes' engine image changed,
|
||||
replica mode change to RW for attached volumes, reference count of the
|
||||
new engine image changed, all engine and replicas' engine image changed,
|
||||
etc ...)
|
||||
8. verify the volumes' data
|
||||
|
||||
Case 2: Dr volume
|
||||
1. Create a backup for vol-0. Create a DR volume from the backup
|
||||
2. Try to upgrade the DR volume engine's image to the new engine image
|
||||
3. Verify that the Longhorn API returns error. Upgrade fails.
|
||||
4. Set concurrent-automatic-engine-upgrade-per-node-limit setting to 0
|
||||
5. Try to upgrade the DR volume engine's image to the new engine image
|
||||
6. Wait until the upgrade are completed (volumes' engine image changed,
|
||||
replicas' mode change to RW, reference count of the new engine image
|
||||
changed, engine and replicas' engine image changed)
|
||||
7. Wait for the DR volume to finish restoring
|
||||
8. Set concurrent-automatic-engine-upgrade-per-node-limit setting to 3
|
||||
9. In a 2-min retry loop, verify that Longhorn doesn't automatically
|
||||
upgrade engine image for DR volume.
|
||||
|
||||
Case 3: Expanding volume
|
||||
1. set concurrent-automatic-engine-upgrade-per-node-limit setting to 0
|
||||
2. Upgrade vol-0 to the new engine image
|
||||
3. Wait until the upgrade are completed (volumes' engine image changed,
|
||||
replicas' mode change to RW, reference count of the new engine image
|
||||
changed, engine and replicas' engine image changed)
|
||||
4. Detach vol-0
|
||||
5. Expand the vol-0 from 1Gb to 5GB
|
||||
6. Wait for the vol-0 to start expanding
|
||||
7. Set concurrent-automatic-engine-upgrade-per-node-limit setting to 3
|
||||
8. While vol-0 is expanding, verify that its engine is not upgraded to
|
||||
the default engine image
|
||||
9. Wait for the expansion to finish and vol-0 is detached
|
||||
10. Verify that Longhorn upgrades vol-0's engine to the default version
|
||||
|
||||
Case 4: Degraded volume
|
||||
1. set concurrent-automatic-engine-upgrade-per-node-limit setting to 0
|
||||
2. Upgrade vol-1 (an healthy attached volume) to the new engine image
|
||||
3. Wait until the upgrade are completed (volumes' engine image changed,
|
||||
replicas' mode change to RW, reference count of the new engine image
|
||||
changed, engine and replicas' engine image changed)
|
||||
4. Increase number of replica count to 4 to make the volume degraded
|
||||
5. Set concurrent-automatic-engine-upgrade-per-node-limit setting to 3
|
||||
6. In a 2-min retry loop, verify that Longhorn doesn't automatically
|
||||
upgrade engine image for vol-1.
|
||||
|
||||
Cleaning up:
|
||||
1. Clean up volumes
|
||||
2. Reset automatically-upgrade-engine-to-default-version setting in
|
||||
the client fixture
|
||||
|
||||
### Upgrade strategy
|
||||
|
||||
No upgrade strategy is needed.
|
||||
|
||||
|
||||
### Addtional Context
|
||||
None
|
Loading…
Reference in New Issue
Block a user