Add LEP for disk anti-affinity
Longhorn 3823 Signed-off-by: Eric Weber <eric.weber@suse.com>
This commit is contained in:
parent
339e501042
commit
eb3e413c6a
155
enhancements/20230718-disk-anti-affinity.md
Normal file
155
enhancements/20230718-disk-anti-affinity.md
Normal file
@ -0,0 +1,155 @@
|
||||
# Disk Anti-Affinity
|
||||
|
||||
## Summary
|
||||
|
||||
Longhorn supports multiple disks per node, but there is currently no way to ensure that two replicas for the same
|
||||
volume that schedule to the same node end up on different disks. In fact, the replica scheduler currently doesn't make
|
||||
any attempt achieve this goal, even when it is possible to do so.
|
||||
|
||||
With the addition of a Disk Anti-Affinity feature, the Longhorn replica scheduler will attempt to schedule two replicas
|
||||
for the same volume to different disks when possible. Optionally, the scheduler will refuse to schedule a replica to a
|
||||
disk that has another replica for the same volume.
|
||||
|
||||
Although the comparison is not perfect, this enhancement can be thought of as enabling RAID 1 for Longhorn (mirroring
|
||||
across multiple disks on the same node).
|
||||
|
||||
See the [Motivation section](#motivation) for potential benefits.
|
||||
|
||||
### Related Issues
|
||||
|
||||
- https://github.com/longhorn/longhorn/issues/3823
|
||||
- https://github.com/longhorn/longhorn/issues/5149
|
||||
|
||||
### Existing Related Features
|
||||
|
||||
#### Replica Node Level Soft Anti-Affinity
|
||||
|
||||
Disabled by default. When disabled, prevents the scheduling of a replica to a node with an existing healthy replica of
|
||||
the same volume.
|
||||
|
||||
Can also be set at the volume level to override the global default.
|
||||
|
||||
#### Replica Zone Level Soft Anti-Affinity
|
||||
|
||||
Enabled by default. When disabled, prevents the scheduling of a replica to a zone with an existing healthy replica of
|
||||
the same volume.
|
||||
|
||||
Can also be set at the volume level to override the global default.
|
||||
|
||||
## Motivation
|
||||
|
||||
- Large, multi-node clusters will likely not benefit from this enhancement.
|
||||
- Single-node clusters and small, multi-node clusters (on which the number of replicas per volume exceeds the number
|
||||
of available nodes) will experience:
|
||||
- Increased data durability. If a single disk fails, a healthy replica will still exist on an disk that
|
||||
has not failed.
|
||||
- Increased data availability. If a single disk on a node becomes unavailable, but the node itself remains
|
||||
healthy, at least one replica remains healthy. On a single-node cluster, this can directly prevent a volume crash.
|
||||
On a small, multi-node cluster, this can prevent a future volume crash due to the loss of a different node.
|
||||
|
||||
### Goals
|
||||
|
||||
- In all situations, the Longhorn replica scheduler will make a best effort to ensure two replicas for the same volume
|
||||
do not schedule to the same disk.
|
||||
- Optionally, the scheduler will refuse to schedule a replica to a disk that has another replica of the same volume.
|
||||
|
||||
## Proposal
|
||||
|
||||
### User Stories
|
||||
|
||||
#### Story 1
|
||||
|
||||
My cluster consists of a single node with multiple attached SSDs. When I create any new volume, I want replicas to
|
||||
distribute across these disks so that I can recover from n - 1 disk failures. If there are not as many available disks
|
||||
as desired replicas, I want Longhorn to do the best it can.
|
||||
|
||||
#### Story 2
|
||||
|
||||
My cluster consists of a single node with multiple attached SSDs. When I create any new volume, I want replicas to
|
||||
distribute across these disks so that I can recover from n - 1 disk failure. If there are not as many available disks
|
||||
as desired replicas, I want scheduling to fail obviously. It is important that I know my volumes aren't being protected
|
||||
so I can take action.
|
||||
|
||||
#### Story 3
|
||||
|
||||
My cluster consists of a single node with multiple attached SSDs. When I create a specific, high-priority volume, I want
|
||||
replicas to distribute across these disks so that I can recover from n - 1 disk failure. If there are not as many
|
||||
available disks as desired replicas, I want scheduling to fail obviously. It is important that I know high-priority
|
||||
volume isn't being protected so I can take action.
|
||||
|
||||
### User Experience In Detail
|
||||
|
||||
### API changes
|
||||
|
||||
Introduce a new Replica Disk Level Soft Anti-Affinity setting with the following definition. By default, set it to
|
||||
`true`. While it is generally desirable to schedule replicas to different disks, it would break with existing behavior
|
||||
to refuse to schedule replicas when different disks are not available.
|
||||
|
||||
```golang
|
||||
SettingDefinitionReplicaDiskSoftAntiAffinity = SettingDefinition{
|
||||
DisplayName: "Replica Disk Level Soft Anti-Affinity",
|
||||
Description: "Allow scheduling on disks with existing healthy replicas of the same volume",
|
||||
Category: SettingCategoryScheduling,
|
||||
Type: SettingTypeBool,
|
||||
Required: true,
|
||||
ReadOnly: false,
|
||||
Default: "true",
|
||||
}
|
||||
```
|
||||
|
||||
Introduce a new `spec.replicaDiskSoftAntiAffinity` volume field. By default, set it to `ignored`. Similar to the
|
||||
existing `spec.replicaSoftAntiAffinity` and `spec.replicaSoftZoneAntiAffinityFields`, override the global setting if
|
||||
this field is set to `enabled` or `disabled`.
|
||||
|
||||
```yaml
|
||||
replicaDiskSoftAntiAffinity:
|
||||
description: Replica disk soft anti affinity of the volume. Set enabled
|
||||
to allow replicas to be scheduled in the same disk.
|
||||
enum:
|
||||
- ignored
|
||||
- enabled
|
||||
- disabled
|
||||
type: string
|
||||
```
|
||||
|
||||
## Design
|
||||
|
||||
### Implementation Overview
|
||||
|
||||
The current replica scheduler does the following:
|
||||
|
||||
1. Determines which nodes a replica can be scheduled to based on node condition and the `ReplicaSoftAntiAffinity` and
|
||||
`ReplicaZoneSoftAntiAffinity` settings.
|
||||
1. Creates a list of all schedulable disks on these nodes.
|
||||
1. Chooses the disk with the most available space for scheduling.
|
||||
|
||||
Add a step so that the replica scheduler:
|
||||
|
||||
1. Determines which nodes a replica can be scheduled to based on node condition and the `ReplicaSoftAntiAffinity` and
|
||||
`ReplicaZoneSoftAntiAffinity` settings.
|
||||
1. Creates a list of all schedulable disks on these nodes.
|
||||
1. Filters the list to include only disks with the least number of existing matching replicas and optionally only disks
|
||||
with no existing matching replicas.
|
||||
1. Chooses the disk from the filtered list with the most available space for scheduling.
|
||||
|
||||
### Test plan
|
||||
|
||||
Minimally implement two new test cases:
|
||||
|
||||
1. In a cluster that includes nodes with multiple available disks, create a volume with
|
||||
`spec.replicaSoftAntiAffinity = true`, `spec.replicaDiskSoftAntiAffinity = true`, and `numberOfReplicas` equal to the
|
||||
total number of disks in the cluster. Confirm that each replica schedules to a different disk. It may be necessary
|
||||
to tweak additional factors. For example, ensure that one disk has enough free space that the old scheduling
|
||||
behavior would assign two replicas to it instead of distributing the replicas evenly among the disks.
|
||||
1. In a cluster that includes nodes with multiple available disks, create a volume with
|
||||
`spec.replicaSoftAntiAffinity = true`, `spec.replicaDiskSoftAntiAffinity = false`, and `numberOfReplicas` equal to
|
||||
one more than the total number of disks in the cluster. Confirm that a replica fails to schedule. Previously,
|
||||
multiple replicas would have scheduled to the same disk and no error would have occurred.
|
||||
|
||||
### Upgrade strategy
|
||||
|
||||
The Replica Disk Level Soft Anti-Affinity setting defaults to `true` to maintain backwards compatibility. It if is set
|
||||
to `false``, new replicas that require scheduling will follow the new behavior.
|
||||
|
||||
The `spec.replicaDiskSoftAntiAffinity` volume field defaults to `ignored` to maintain backwards compatibility. If it is
|
||||
set to `enabled` on a volume, new replicas for that volume that require scheduling will follow the new behavior.
|
Loading…
Reference in New Issue
Block a user