From eb3e413c6ad57fd19e1e2af9f15217dd29c842a5 Mon Sep 17 00:00:00 2001
From: Eric Weber <eric.weber@suse.com>
Date: Mon, 31 Jul 2023 15:27:33 -0500
Subject: [PATCH] Add LEP for disk anti-affinity

Longhorn 3823

Signed-off-by: Eric Weber <eric.weber@suse.com>
---
 enhancements/20230718-disk-anti-affinity.md | 155 ++++++++++++++++++++
 1 file changed, 155 insertions(+)
 create mode 100644 enhancements/20230718-disk-anti-affinity.md

diff --git a/enhancements/20230718-disk-anti-affinity.md b/enhancements/20230718-disk-anti-affinity.md
new file mode 100644
index 0000000..19b2de7
--- /dev/null
+++ b/enhancements/20230718-disk-anti-affinity.md
@@ -0,0 +1,155 @@
+# Disk Anti-Affinity
+
+## Summary
+
+Longhorn supports multiple disks per node, but there is currently no way to ensure that two replicas for the same
+volume that schedule to the same node end up on different disks. In fact, the replica scheduler currently doesn't make
+any attempt achieve this goal, even when it is possible to do so.
+
+With the addition of a Disk Anti-Affinity feature, the Longhorn replica scheduler will attempt to schedule two replicas
+for the same volume to different disks when possible. Optionally, the scheduler will refuse to schedule a replica to a
+disk that has another replica for the same volume.
+
+Although the comparison is not perfect, this enhancement can be thought of as enabling RAID 1 for Longhorn (mirroring
+across multiple disks on the same node).
+
+See the [Motivation section](#motivation) for potential benefits.
+
+### Related Issues
+
+- https://github.com/longhorn/longhorn/issues/3823
+- https://github.com/longhorn/longhorn/issues/5149
+
+### Existing Related Features
+
+#### Replica Node Level Soft Anti-Affinity
+
+Disabled by default. When disabled, prevents the scheduling of a replica to a node with an existing healthy replica of
+the same volume.
+
+Can also be set at the volume level to override the global default.
+
+#### Replica Zone Level Soft Anti-Affinity
+
+Enabled by default. When disabled, prevents the scheduling of a replica to a zone with an existing healthy replica of
+the same volume.
+
+Can also be set at the volume level to override the global default.
+
+## Motivation
+
+- Large, multi-node clusters will likely not benefit from this enhancement.
+- Single-node clusters and small, multi-node clusters (on which the number of replicas per volume exceeds the number
+  of available nodes) will experience:
+  - Increased data durability. If a single disk fails, a healthy replica will still exist on an disk that
+    has not failed.
+  - Increased data availability. If a single disk on a node becomes unavailable, but the node itself remains
+    healthy, at least one replica remains healthy. On a single-node cluster, this can directly prevent a volume crash.
+    On a small, multi-node cluster, this can prevent a future volume crash due to the loss of a different node.
+
+### Goals
+
+- In all situations, the Longhorn replica scheduler will make a best effort to ensure two replicas for the same volume
+  do not schedule to the same disk.
+- Optionally, the scheduler will refuse to schedule a replica to a disk that has another replica of the same volume.
+
+## Proposal
+
+### User Stories
+
+#### Story 1
+
+My cluster consists of a single node with multiple attached SSDs. When I create any new volume, I want replicas to
+distribute across these disks so that I can recover from n - 1 disk failures. If there are not as many available disks
+as desired replicas, I want Longhorn to do the best it can.
+
+#### Story 2
+
+My cluster consists of a single node with multiple attached SSDs. When I create any new volume, I want replicas to
+distribute across these disks so that I can recover from n - 1 disk failure. If there are not as many available disks
+as desired replicas, I want scheduling to fail obviously. It is important that I know my volumes aren't being protected
+so I can take action.
+
+#### Story 3
+
+My cluster consists of a single node with multiple attached SSDs. When I create a specific, high-priority volume, I want
+replicas to distribute across these disks so that I can recover from n - 1 disk failure. If there are not as many
+available disks as desired replicas, I want scheduling to fail obviously. It is important that I know high-priority
+volume isn't being protected so I can take action.
+
+### User Experience In Detail
+
+### API changes
+
+Introduce a new Replica Disk Level Soft Anti-Affinity setting with the following definition. By default, set it to
+`true`. While it is generally desirable to schedule replicas to different disks, it would break with existing behavior
+to refuse to schedule replicas when different disks are not available.
+
+```golang
+SettingDefinitionReplicaDiskSoftAntiAffinity = SettingDefinition{
+    DisplayName: "Replica Disk Level Soft Anti-Affinity",
+    Description: "Allow scheduling on disks with existing healthy replicas of the same volume",
+    Category:    SettingCategoryScheduling,
+    Type:        SettingTypeBool,
+    Required:    true,
+    ReadOnly:    false,
+    Default:     "true",
+}
+```
+
+Introduce a new `spec.replicaDiskSoftAntiAffinity` volume field. By default, set it to `ignored`. Similar to the
+existing `spec.replicaSoftAntiAffinity` and `spec.replicaSoftZoneAntiAffinityFields`, override the global setting if
+this field is set to `enabled` or `disabled`.
+
+```yaml
+replicaDiskSoftAntiAffinity:
+  description: Replica disk soft anti affinity of the volume. Set enabled
+    to allow replicas to be scheduled in the same disk.
+  enum:
+  - ignored
+  - enabled
+  - disabled
+  type: string
+```
+
+## Design
+
+### Implementation Overview
+
+The current replica scheduler does the following:
+
+1. Determines which nodes a replica can be scheduled to based on node condition and the `ReplicaSoftAntiAffinity` and
+   `ReplicaZoneSoftAntiAffinity` settings.
+1. Creates a list of all schedulable disks on these nodes.
+1. Chooses the disk with the most available space for scheduling.
+
+Add a step so that the replica scheduler:
+
+1. Determines which nodes a replica can be scheduled to based on node condition and the `ReplicaSoftAntiAffinity` and
+   `ReplicaZoneSoftAntiAffinity` settings.
+1. Creates a list of all schedulable disks on these nodes.
+1. Filters the list to include only disks with the least number of existing matching replicas and optionally only disks
+   with no existing matching replicas.
+1. Chooses the disk from the filtered list with the most available space for scheduling.
+
+### Test plan
+
+Minimally implement two new test cases:
+
+1. In a cluster that includes nodes with multiple available disks, create a volume with
+   `spec.replicaSoftAntiAffinity = true`, `spec.replicaDiskSoftAntiAffinity = true`, and `numberOfReplicas` equal to the
+   total number of disks in the cluster. Confirm that each replica schedules to a different disk. It may be necessary
+   to tweak additional factors. For example, ensure that one disk has enough free space that the old scheduling
+   behavior would assign two replicas to it instead of distributing the replicas evenly among the disks.
+1. In a cluster that includes nodes with multiple available disks, create a volume with
+   `spec.replicaSoftAntiAffinity = true`, `spec.replicaDiskSoftAntiAffinity = false`, and `numberOfReplicas` equal to
+   one more than the total number of disks in the cluster. Confirm that a replica fails to schedule. Previously,
+   multiple replicas would have scheduled to the same disk and no error would have occurred.
+
+### Upgrade strategy
+
+The Replica Disk Level Soft Anti-Affinity setting defaults to `true` to maintain backwards compatibility. It if is set
+to `false``, new replicas that require scheduling will follow the new behavior.
+
+The `spec.replicaDiskSoftAntiAffinity` volume field defaults to `ignored` to maintain backwards compatibility. If it is
+set to `enabled` on a volume, new replicas for that volume that require scheduling will follow the new behavior.