Add LEP - Support K8s CA
Longhorn-2203 Signed-off-by: Chin-Ya Huang <chin-ya.huang@suse.com>
This commit is contained in:
parent
045f94086d
commit
1478f30841
177
enhancements/20220408-support-kubernetes-ca.md
Normal file
177
enhancements/20220408-support-kubernetes-ca.md
Normal file
@ -0,0 +1,177 @@
|
|||||||
|
# Support Kubernetes Cluster Autoscaler
|
||||||
|
|
||||||
|
Longhorn should support Kubernetes Cluster Autoscaler.
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Currently, Longhorn pods are [blocking CA from removing a node](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node). This proposes to introduce a new global setting `kubernetes-cluster-autoscaler-enabled` that will annotate Longhorn components and also add logic for instance-manager PodDisruptionBudget management.
|
||||||
|
|
||||||
|
### Related Issues
|
||||||
|
|
||||||
|
https://github.com/longhorn/longhorn/issues/2203
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
### Goals
|
||||||
|
|
||||||
|
- Longhorn should block CA from scaling down if a node met ANY condition:
|
||||||
|
- Any volume attached
|
||||||
|
- Contains a backing image manager pod
|
||||||
|
- Contains a share manager pod
|
||||||
|
- Longhorn should not block CA from scaling down if a node met ALL conditions:
|
||||||
|
- All volume detached and there is another schedulable node with volume replica and replica IM PDB.
|
||||||
|
- Not contain a backing image manager pod
|
||||||
|
- Not contain a share manager pod
|
||||||
|
|
||||||
|
### Non-goals [optional]
|
||||||
|
|
||||||
|
- CA setup.
|
||||||
|
- CA blocked by kube-system components.
|
||||||
|
- CA blocked by backing image manager pod. (TODO)
|
||||||
|
- CA blocked by share manager pod. (TODO)
|
||||||
|
|
||||||
|
## Proposal
|
||||||
|
Set `kubernetes-cluster-autoscaler-enabled` adds `cluster-autoscaler.kubernetes.io/safe-to-evict` annotation to Longhorn pods that are not backed by a controller, or with local storage volume mounts. To avoid data loss, Longhorn does not annotate the backing image manager and share manager pods.
|
||||||
|
|
||||||
|
Currently, Longhorn creates instance-manager PDBs for replica/engine regardless of the volume state.
|
||||||
|
During scale down, CA tries to find a removable node but failed by those instance-manager PDBs.
|
||||||
|
|
||||||
|
We can add IM PDB handling to create and retained when the PDB is required:
|
||||||
|
|
||||||
|
- There are volumes/engines running on the node. We need to guarantee that the volumes won't crash.
|
||||||
|
- The only available/valid replica of a volume is on the node. Here we need to prevent the volume data from being lost.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
#### CA scaling
|
||||||
|
Before the enhancement, CA will be blocked by
|
||||||
|
- Pods that are not backed by a controller (engine/replica instance manager).
|
||||||
|
- Pods with [local storage volume mounts](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/utils/drain/drain.go#L222) (longhorn-ui, longhorn-csi-plugin, csi-attacher, csi-provisioner, csi-resizer, csi-snapshotter).
|
||||||
|
|
||||||
|
After enhancement, instance manager PDB will be actively managed by Longhorn:
|
||||||
|
- Creates all engine/replica instance manager PDB when the volume is attached.
|
||||||
|
- Delete engine instance manager PDB when the volume is detached.
|
||||||
|
- Delete but keep 1 replica instance manager PDB when the volume is detached.
|
||||||
|
|
||||||
|
the user can set a new global setting `kubernetes-cluster-autoscaler-enabled` to unblock CA scaling. This allows Longhorn to annotate Longhorn-managed deployments and engine/replica instance manager pods with `cluster-autoscaler.kubernetes.io/safe-to-evict`.
|
||||||
|
|
||||||
|
|
||||||
|
### User Experience In Detail
|
||||||
|
|
||||||
|
- Configure the setting via Longhorn UI or kubectl.
|
||||||
|
- Ensure all volume replica count is set to more than 1.
|
||||||
|
- CA is not blocked by Longhorn components when the node doesn't contain volume replica, backing image manager pod, and share manager pod.
|
||||||
|
- Engine/Replica instance-manager PDB will block the node if the volume is attached.
|
||||||
|
- Replica instance-manager PDB will block the node when CA tries to delete the last node with the volume replica.
|
||||||
|
|
||||||
|
### API changes
|
||||||
|
|
||||||
|
`None`
|
||||||
|
|
||||||
|
## Design
|
||||||
|
|
||||||
|
### Implementation Overview
|
||||||
|
|
||||||
|
#### Global setting
|
||||||
|
- Add new global setting `Kubernetes Cluster Autoscaler Enabled (Experimental)`.
|
||||||
|
- The setting is `boolean`.
|
||||||
|
- The default value is `false`.
|
||||||
|
|
||||||
|
#### Annotations
|
||||||
|
|
||||||
|
When setting `kubernetes-cluster-autoscaler-enabled` is `true`, Longhorn will add annotation `cluster-autoscaler.kubernetes.io/safe-to-evict` for the following pods:
|
||||||
|
- The engine and replica instance-manager pods because those are not backed by a controller and use local storage mounts.
|
||||||
|
- The deployment workloads are managed by the longhorn manager and using any local storage mount. The managed components are labeled with `longhorn.io/managed-by: longhorn-manager`.
|
||||||
|
|
||||||
|
#### PodDisruptionBudget
|
||||||
|
|
||||||
|
- No change to the logic to cleanup PDB if instance-manager doesn't exist.
|
||||||
|
|
||||||
|
- Engine IM PDB:
|
||||||
|
- Delete PDB if volumes are detached;
|
||||||
|
- There is no instance process in IM (im.Status.Instance).
|
||||||
|
- The same logic applies when a node is un-schedulable. Node is un-schedulable when marked in spec or with CA tainted `ToBeDeletedByClusterAutoscaler`;
|
||||||
|
- Create PDB if volumes are attached; there are instance processes in IM (im.Status.Instance).
|
||||||
|
|
||||||
|
- Replica IM PDB:
|
||||||
|
- Delete PDB if setting `allow-node-drain-with-last-healthy-replica` is enabled.
|
||||||
|
- Delete PDB if volumes are detached;
|
||||||
|
- There is no instance process in IM (im.Status.Instance)
|
||||||
|
- There are other schedulable nodes with healthy volume replica and have replica IM PDB.
|
||||||
|
- Delete PDB when a node is un-schedulable. Node is un-schedulable when marked in spec or with CA tainted `ToBeDeletedByClusterAutoscaler`;
|
||||||
|
- Check if the condition is met to delete PDB (same check as to when volumes are detached).
|
||||||
|
- Enqueue the replica instance-manager of another schedulable node with the volume replica.
|
||||||
|
- Delete PDB.
|
||||||
|
- Create PDB if volumes are attached:
|
||||||
|
- There are instance processes in IM (im.Status.Instance).
|
||||||
|
- Create PDB when volumes are detached;
|
||||||
|
- There is no instance process in IM (im.Status.Instance)
|
||||||
|
- The replica has been started. There are no other schedulable nodes with healthy volume replica and have replica IM PDB.
|
||||||
|
|
||||||
|
### Test plan
|
||||||
|
|
||||||
|
#### Scenario: test CA
|
||||||
|
|
||||||
|
Given Cluster with Kubernetes cluster-autoscaler.
|
||||||
|
And Longhorn installed.
|
||||||
|
And Set `kubernetes-cluster-autoscaler-enabled` to `true`.
|
||||||
|
And Create deployment with cpu request.
|
||||||
|
```
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
cpu: 300m
|
||||||
|
memory: 30Mi
|
||||||
|
requests:
|
||||||
|
cpu: 150m
|
||||||
|
memory: 15Mi
|
||||||
|
```
|
||||||
|
|
||||||
|
When Trigger CA to scale-up by increase deployment replicas.
|
||||||
|
(double the node number, not including host node)
|
||||||
|
```
|
||||||
|
10 * math.ceil(allocatable_millicpu/cpu_request*node_number/10)
|
||||||
|
```
|
||||||
|
Then Cluster should have double the node number.
|
||||||
|
|
||||||
|
When Trigger CA to scale-down by decrease deployment replicas.
|
||||||
|
(original node number)
|
||||||
|
Then Cluster should have original node number.
|
||||||
|
|
||||||
|
#### Scenario: test CA scale down all nodes containing volume replicas
|
||||||
|
|
||||||
|
Given Cluster with Kubernetes cluster-autoscaler.
|
||||||
|
And Longhorn installed.
|
||||||
|
And Set `kubernetes-cluster-autoscaler-enabled` to `true`.
|
||||||
|
And Create volume.
|
||||||
|
And Attach the volume.
|
||||||
|
And Write some data to volume.
|
||||||
|
And Detach the volume.
|
||||||
|
And Create deployment with cpu request.
|
||||||
|
|
||||||
|
When Trigger CA to scale-up by increase deployment replicas.
|
||||||
|
(double the node number, not including host node)
|
||||||
|
Then Cluster should have double the node number.
|
||||||
|
|
||||||
|
When Annotate new nodes with `cluster-autoscaler.kubernetes.io/scale-down-disabled`.
|
||||||
|
(this ensures scale-down only the old nodes)
|
||||||
|
And Trigger CA to scale-down by decrease deployment replicas.
|
||||||
|
(original node number)
|
||||||
|
Then Cluster should have original node number + 1 blocked node.
|
||||||
|
|
||||||
|
When Attach the volume to a new node. This triggers replica rebuild.
|
||||||
|
And Volume data should be the same.
|
||||||
|
And Detach the volume.
|
||||||
|
Then Cluster should have original node number.
|
||||||
|
And Volume data should be the same.
|
||||||
|
|
||||||
|
#### Scenario: test CA should block scale down of node running backing image manager pod
|
||||||
|
|
||||||
|
Similar to `Scenario: test CA scale down all nodes containing volume replicas`.
|
||||||
|
|
||||||
|
### Upgrade strategy
|
||||||
|
|
||||||
|
`N/A`
|
||||||
|
|
||||||
|
## Note [optional]
|
||||||
|
|
||||||
|
`N/A`
|
Loading…
Reference in New Issue
Block a user