longhorn/enhancements/20221205-concurrent-backup-restore-limit.md
Chin-Ya Huang d57db31395 feat(lep): concurrent backup restore design
Ref: 4558

Signed-off-by: Chin-Ya Huang <chin-ya.huang@suse.com>
2022-12-08 22:37:40 +08:00

2.6 KiB

Concurrent Backup Restore Per Node Limit

Summary

Longhorn has no boundary on the number of concurrent volume backup restoring.

Having a new concurrent-backup-restore-per-node-limit setting allows the user to limit the concurring backup restoring. Setting this restriction lowers the potential risk of overloading the cluster when volumes restoring from backup concurrently. For ex: during the Longhorn system restore.

https://github.com/longhorn/longhorn/issues/4558

Motivation

Goals

Introduce a new concurrent-backup-restore-per-node-limit setting to define the boundary of the concurrent volume backup restoring.

Non-goals

None

Proposal

  1. Introduce a new concurrent-backup-restore-per-node-limit setting.
  2. Track the number of per-node volumes restoring from backup with atomic count (thread-safe) in the engine monitor.

User Stories

Allow the user to set the concurrent backup restore per node limit to control the risk of cluster overload when Longhorn volume is restoring from backup concurrently.

User Experience In Detail

  1. Longhorn holds the engine backup restore when the number of volume backups restoring on a node reaches the concurrent-backup-restore-per-node-limit.
  2. The volume backup restore continues when the number of volume backups restoring on a node is below the limit.

Design

Implementation Overview

The concurrent-backup-restore-per-node-limit Setting

This setting controls how many engines on a node can restore the backup concurrently.

Longhorn engine monitor backs off when the volume backup restoring count reaches the setting limit.

Set the value to 0 to disable backup restore.

Category = SettingCategoryGeneral,
Type     = integer
Default  = 5  # same as the default replica rebuilding number

Track the volume backup restoring per node

  1. Create a new atomic counter in the engine controller.

    type EngineController struct {
       restoringCounter util.Counter
    }
    
  2. Pass the restoring counter to each of its engine monitors.

    type EngineMonitor struct {
       restoringCounter util.Counter
    }
    
  3. Increase the restoring counter before backup restore.

    Ignore DR volumes (volume.Status.IsStandby).

  4. Decrease the restoring counter when the backup restore caller method ends

Test plan

  • Test the setting should block backup restore when creating multiple volumes from the backup at the same time.
  • Test the setting should be per-node limited.
  • Test the setting should not have effect on DR volumes.

Upgrade strategy

None

Note [optional]

None