longhorn/enhancements/20230718-disk-anti-affinity.md
Eric Weber eb3e413c6a Add LEP for disk anti-affinity
Longhorn 3823

Signed-off-by: Eric Weber <eric.weber@suse.com>
2023-08-07 22:43:19 +08:00

7.1 KiB

Disk Anti-Affinity

Summary

Longhorn supports multiple disks per node, but there is currently no way to ensure that two replicas for the same volume that schedule to the same node end up on different disks. In fact, the replica scheduler currently doesn't make any attempt achieve this goal, even when it is possible to do so.

With the addition of a Disk Anti-Affinity feature, the Longhorn replica scheduler will attempt to schedule two replicas for the same volume to different disks when possible. Optionally, the scheduler will refuse to schedule a replica to a disk that has another replica for the same volume.

Although the comparison is not perfect, this enhancement can be thought of as enabling RAID 1 for Longhorn (mirroring across multiple disks on the same node).

See the Motivation section for potential benefits.

Replica Node Level Soft Anti-Affinity

Disabled by default. When disabled, prevents the scheduling of a replica to a node with an existing healthy replica of the same volume.

Can also be set at the volume level to override the global default.

Replica Zone Level Soft Anti-Affinity

Enabled by default. When disabled, prevents the scheduling of a replica to a zone with an existing healthy replica of the same volume.

Can also be set at the volume level to override the global default.

Motivation

  • Large, multi-node clusters will likely not benefit from this enhancement.
  • Single-node clusters and small, multi-node clusters (on which the number of replicas per volume exceeds the number of available nodes) will experience:
    • Increased data durability. If a single disk fails, a healthy replica will still exist on an disk that has not failed.
    • Increased data availability. If a single disk on a node becomes unavailable, but the node itself remains healthy, at least one replica remains healthy. On a single-node cluster, this can directly prevent a volume crash. On a small, multi-node cluster, this can prevent a future volume crash due to the loss of a different node.

Goals

  • In all situations, the Longhorn replica scheduler will make a best effort to ensure two replicas for the same volume do not schedule to the same disk.
  • Optionally, the scheduler will refuse to schedule a replica to a disk that has another replica of the same volume.

Proposal

User Stories

Story 1

My cluster consists of a single node with multiple attached SSDs. When I create any new volume, I want replicas to distribute across these disks so that I can recover from n - 1 disk failures. If there are not as many available disks as desired replicas, I want Longhorn to do the best it can.

Story 2

My cluster consists of a single node with multiple attached SSDs. When I create any new volume, I want replicas to distribute across these disks so that I can recover from n - 1 disk failure. If there are not as many available disks as desired replicas, I want scheduling to fail obviously. It is important that I know my volumes aren't being protected so I can take action.

Story 3

My cluster consists of a single node with multiple attached SSDs. When I create a specific, high-priority volume, I want replicas to distribute across these disks so that I can recover from n - 1 disk failure. If there are not as many available disks as desired replicas, I want scheduling to fail obviously. It is important that I know high-priority volume isn't being protected so I can take action.

User Experience In Detail

API changes

Introduce a new Replica Disk Level Soft Anti-Affinity setting with the following definition. By default, set it to true. While it is generally desirable to schedule replicas to different disks, it would break with existing behavior to refuse to schedule replicas when different disks are not available.

SettingDefinitionReplicaDiskSoftAntiAffinity = SettingDefinition{
    DisplayName: "Replica Disk Level Soft Anti-Affinity",
    Description: "Allow scheduling on disks with existing healthy replicas of the same volume",
    Category:    SettingCategoryScheduling,
    Type:        SettingTypeBool,
    Required:    true,
    ReadOnly:    false,
    Default:     "true",
}

Introduce a new spec.replicaDiskSoftAntiAffinity volume field. By default, set it to ignored. Similar to the existing spec.replicaSoftAntiAffinity and spec.replicaSoftZoneAntiAffinityFields, override the global setting if this field is set to enabled or disabled.

replicaDiskSoftAntiAffinity:
  description: Replica disk soft anti affinity of the volume. Set enabled
    to allow replicas to be scheduled in the same disk.
  enum:
  - ignored
  - enabled
  - disabled
  type: string

Design

Implementation Overview

The current replica scheduler does the following:

  1. Determines which nodes a replica can be scheduled to based on node condition and the ReplicaSoftAntiAffinity and ReplicaZoneSoftAntiAffinity settings.
  2. Creates a list of all schedulable disks on these nodes.
  3. Chooses the disk with the most available space for scheduling.

Add a step so that the replica scheduler:

  1. Determines which nodes a replica can be scheduled to based on node condition and the ReplicaSoftAntiAffinity and ReplicaZoneSoftAntiAffinity settings.
  2. Creates a list of all schedulable disks on these nodes.
  3. Filters the list to include only disks with the least number of existing matching replicas and optionally only disks with no existing matching replicas.
  4. Chooses the disk from the filtered list with the most available space for scheduling.

Test plan

Minimally implement two new test cases:

  1. In a cluster that includes nodes with multiple available disks, create a volume with spec.replicaSoftAntiAffinity = true, spec.replicaDiskSoftAntiAffinity = true, and numberOfReplicas equal to the total number of disks in the cluster. Confirm that each replica schedules to a different disk. It may be necessary to tweak additional factors. For example, ensure that one disk has enough free space that the old scheduling behavior would assign two replicas to it instead of distributing the replicas evenly among the disks.
  2. In a cluster that includes nodes with multiple available disks, create a volume with spec.replicaSoftAntiAffinity = true, spec.replicaDiskSoftAntiAffinity = false, and numberOfReplicas equal to one more than the total number of disks in the cluster. Confirm that a replica fails to schedule. Previously, multiple replicas would have scheduled to the same disk and no error would have occurred.

Upgrade strategy

The Replica Disk Level Soft Anti-Affinity setting defaults to true to maintain backwards compatibility. It if is set to `false``, new replicas that require scheduling will follow the new behavior.

The spec.replicaDiskSoftAntiAffinity volume field defaults to ignored to maintain backwards compatibility. If it is set to enabled on a volume, new replicas for that volume that require scheduling will follow the new behavior.