feat(lep): concurrent backup restore design

Ref: 4558 Signed-off-by: Chin-Ya Huang <chin-ya.huang@suse.com>
2022-12-05 11:19:42 +08:00 · 2022-12-05 11:19:42 +08:00 · d57db31395
commit d57db31395
parent 986c4b96b0
1 changed files with 78 additions and 0 deletions
--- a/enhancements/20221205-concurrent-backup-restore-limit.md
+++ b/enhancements/20221205-concurrent-backup-restore-limit.md
@ -0,0 +1,78 @@
+# Concurrent Backup Restore Per Node Limit
+
+## Summary
+Longhorn has no boundary on the number of concurrent volume backup restoring.
+
+Having a new `concurrent-backup-restore-per-node-limit` setting allows the user to limit the concurring backup restoring. Setting this restriction lowers the potential risk of overloading the cluster when volumes restoring from backup concurrently. For ex: during the Longhorn system restore.
+
+### Related Issues
+https://github.com/longhorn/longhorn/issues/4558
+
+## Motivation
+### Goals
+Introduce a new `concurrent-backup-restore-per-node-limit` setting to define the boundary of the concurrent volume backup restoring.
+
+### Non-goals
+`None`
+
+## Proposal
+1. Introduce a new `concurrent-backup-restore-per-node-limit` setting.
+1. Track the number of per-node volumes restoring from backup with atomic count (thread-safe) in the engine monitor.
+
+### User Stories
+Allow the user to set the concurrent backup restore per node limit to control the risk of cluster overload when Longhorn volume is restoring from backup concurrently.
+
+### User Experience In Detail
+1. Longhorn holds the engine backup restore when the number of volume backups restoring on a node reaches the `concurrent-backup-restore-per-node-limit`.
+1. The volume backup restore continues when the number of volume backups restoring on a node is below the limit.
+
+## Design
+
+### Implementation Overview
+
+#### The `concurrent-backup-restore-per-node-limit` Setting
+
+This setting controls how many engines on a node can restore the backup concurrently.
+
+Longhorn engine monitor backs off when the volume [backup restoring count](#track-the-volume-backup-restoring-per-node) reaches the setting limit.
+
+Set the value to **0** to disable backup restore.
+
+```
+Category = SettingCategoryGeneral,
+Type     = integer
+Default  = 5  # same as the default replica rebuilding number
+```
+
+#### Track the volume backup restoring per node
+
+1. Create a new atomic counter in the engine controller.
+   ```
+   type EngineController struct {
+      restoringCounter util.Counter
+   }
+   ```
+1. Pass the restoring counter to each of its engine monitors.
+   ```
+   type EngineMonitor struct {
+      restoringCounter util.Counter
+   }
+   ```
+
+1. Increase the restoring counter before backup restore.
+   > Ignore DR volumes (volume.Status.IsStandby).
+1. Decrease the restoring counter when the backup restore caller method ends
+
+### Test plan
+
+- Test the setting should block backup restore when creating multiple volumes from the backup at the same time.
+- Test the setting should be per-node limited.
+- Test the setting should not have effect on DR volumes.
+
+### Upgrade strategy
+
+`None`
+
+## Note [optional]
+
+`None`