feat(lep): concurrent backup restore design
Ref: 4558 Signed-off-by: Chin-Ya Huang <chin-ya.huang@suse.com>
This commit is contained in:
parent
986c4b96b0
commit
d57db31395
78
enhancements/20221205-concurrent-backup-restore-limit.md
Normal file
78
enhancements/20221205-concurrent-backup-restore-limit.md
Normal file
@ -0,0 +1,78 @@
|
|||||||
|
# Concurrent Backup Restore Per Node Limit
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
Longhorn has no boundary on the number of concurrent volume backup restoring.
|
||||||
|
|
||||||
|
Having a new `concurrent-backup-restore-per-node-limit` setting allows the user to limit the concurring backup restoring. Setting this restriction lowers the potential risk of overloading the cluster when volumes restoring from backup concurrently. For ex: during the Longhorn system restore.
|
||||||
|
|
||||||
|
### Related Issues
|
||||||
|
https://github.com/longhorn/longhorn/issues/4558
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
### Goals
|
||||||
|
Introduce a new `concurrent-backup-restore-per-node-limit` setting to define the boundary of the concurrent volume backup restoring.
|
||||||
|
|
||||||
|
### Non-goals
|
||||||
|
`None`
|
||||||
|
|
||||||
|
## Proposal
|
||||||
|
1. Introduce a new `concurrent-backup-restore-per-node-limit` setting.
|
||||||
|
1. Track the number of per-node volumes restoring from backup with atomic count (thread-safe) in the engine monitor.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
Allow the user to set the concurrent backup restore per node limit to control the risk of cluster overload when Longhorn volume is restoring from backup concurrently.
|
||||||
|
|
||||||
|
### User Experience In Detail
|
||||||
|
1. Longhorn holds the engine backup restore when the number of volume backups restoring on a node reaches the `concurrent-backup-restore-per-node-limit`.
|
||||||
|
1. The volume backup restore continues when the number of volume backups restoring on a node is below the limit.
|
||||||
|
|
||||||
|
## Design
|
||||||
|
|
||||||
|
### Implementation Overview
|
||||||
|
|
||||||
|
#### The `concurrent-backup-restore-per-node-limit` Setting
|
||||||
|
|
||||||
|
This setting controls how many engines on a node can restore the backup concurrently.
|
||||||
|
|
||||||
|
Longhorn engine monitor backs off when the volume [backup restoring count](#track-the-volume-backup-restoring-per-node) reaches the setting limit.
|
||||||
|
|
||||||
|
Set the value to **0** to disable backup restore.
|
||||||
|
|
||||||
|
```
|
||||||
|
Category = SettingCategoryGeneral,
|
||||||
|
Type = integer
|
||||||
|
Default = 5 # same as the default replica rebuilding number
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Track the volume backup restoring per node
|
||||||
|
|
||||||
|
1. Create a new atomic counter in the engine controller.
|
||||||
|
```
|
||||||
|
type EngineController struct {
|
||||||
|
restoringCounter util.Counter
|
||||||
|
}
|
||||||
|
```
|
||||||
|
1. Pass the restoring counter to each of its engine monitors.
|
||||||
|
```
|
||||||
|
type EngineMonitor struct {
|
||||||
|
restoringCounter util.Counter
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
1. Increase the restoring counter before backup restore.
|
||||||
|
> Ignore DR volumes (volume.Status.IsStandby).
|
||||||
|
1. Decrease the restoring counter when the backup restore caller method ends
|
||||||
|
|
||||||
|
### Test plan
|
||||||
|
|
||||||
|
- Test the setting should block backup restore when creating multiple volumes from the backup at the same time.
|
||||||
|
- Test the setting should be per-node limited.
|
||||||
|
- Test the setting should not have effect on DR volumes.
|
||||||
|
|
||||||
|
### Upgrade strategy
|
||||||
|
|
||||||
|
`None`
|
||||||
|
|
||||||
|
## Note [optional]
|
||||||
|
|
||||||
|
`None`
|
Loading…
Reference in New Issue
Block a user