7.3 KiB
Improve Node Failure Handling By Automatically Force Delete Terminating Pods of StatefulSet/Deployment On Down Node
Summary
Kubernetes never force deletes pods of StatefulSet or Deployment on a down node. Since the pod on the down node wasn't removed, the volume will be stuck on the down node with it as well. The replacement pods cannot be started because the Longhorn volume is RWO (see more about access modes here), which can only be attached to one node at a time. We provide an option for users to help them automatically force delete terminating pods of StatefulSet/Deployment on the down node. After force deleting, Kubernetes will detach Longhorn volume and spin up replacement pods on a new node.
Related Issues
https://github.com/longhorn/longhorn/issues/1105
Motivation
Goals
The goal is to help the users to monitor node status and automatically force delete terminating pods on down nodes. Without this feature, users would have to manually force delete the pods so that new replacement pods can be started.
Proposal
Implemented a mechanism to force delete pods in the Deployment/StatefulSet on a down node. There are 4 options for NodeDownPodDeletionPolicy
:
DoNothing
DeleteStatefulSetPod
DeleteDeploymentPod
DeleteBothStatefulsetAndDeploymentPod
When the setting is enabled, Longhorn will monitor node status and force delete pods on the down node on the behalf of users.
User Stories
Before this feature is implemented, the users would have to manually monitor and force delete pods when a node down so that Longhorn volume can be detached and a new replacement pod can start.
This process should be automated. After this feature is implemented, the users can have the option to allow Longhorn to monitor and force delete the pods on their behalf.
User Experience In Detail
To use this enhancement, users need to change the Longhorn setting NodeDownPodDeletionPolicy
. The default setting is DoNothing
which means Longhorn will not force delete any pods on a down node.
As a side note, even when NodeDownPodDeletionPolicy
is set to do-nothing
, the automatic VolumeAttachment removal still works so deployment pods are fine if users enable automatic volumeattachment
removal.
API changes
No API changes.
Design
We created a new controller, Kubernetes POD Controller
, to watch pods and nodes status and handle the force deletion. Force delete a pod when all of the below conditions are met:
-
The
NodeDownPodDeletionPolicy
and pods' owner are as in the below table:Policy \ Kind StatefulSet
ReplicaSet
Other DoNothing
Don't delete Don't delete Don't delete DeleteStatefulSetPod
Force delete Don't delete Don't delete DeleteDeploymentPod
Don't delete Force delete Don't delete DeleteBothStatefulsetAndDeploymentPod
Force delete Force delete Don't delete -
Node containing the pod is down which is determined by the IsNodeDownOrDeleted. The function
IsNodeDownOrDeleted
checks whether the node status isNotReady
-
The pod is terminating (which means the pod has deletionTimestamp set) and the DeletionTimestamp has passed.
-
Pod has a PV with provisioner
driver.longhorn.io
Implementation Overview
Same as the Design
Test plan
- Setup a cluster of 3 nodes
- Install Longhorn and set
Default Replica Count = 2
(because we will turn off one node) - Create a StatefulSet with 2 pods using the command:
kubectl create -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/statefulset.yaml
- Create a volume + pv + pvc named
vol1
and create a deployment of default ubuntu namedshell
with the usage of pvcvol1
mounted under/mnt/vol1
- Find the node which contains one pod of the StatefulSet/Deployment. Power off the node
StatefulSet
if NodeDownPodDeletionPolicy
is set to do-nothing
| delete-deployment-pod
- wait till the
pod.deletionTimestamp
has passed - verify no replacement pod generated, the pod is stuck at terminating forever.
if NodeDownPodDeletionPolicy
is set to delete-statefulset-pod
| delete-both-statefulset-and-deployment-pod
- wait till pod's status becomes
terminating
and thepod.deletionTimestamp
has passed (around 7 minutes) - verify that the pod is deleted and there is a new running replacement pod.
- Verify that you can access/read/write the volume on the new pod
Deployment
if NodeDownPodDeletionPolicy
is set to do-nothing
| delete-statefulset-pod
AND Volume Attachment Recovery Policy
is never
- wait till the
pod.deletionTimestamp
has passed - replacement pod will be stuck in
Pending
state forever - force delete the terminating pod
- wait till replacement pod is running
- verify that you can access
vol1
via theshell
replacement pod under/mnt/vol1
once it is in the running state
if NodeDownPodDeletionPolicy
is set to do-nothing
| delete-statefulset-pod
AND Volume Attachment Recovery Policy
is wait
- wait till replacement pod is generated (default is around 6 minutes, kubernetes setting)
- wait till the
pod.deletionTimestamp
has passed - verify that you can access
vol1
via theshell
replacement pod under/mnt/vol1
once it is in the running state - verify that the original
shell
pod is stuck inPending
state forever
if NodeDownPodDeletionPolicy
is set to do-nothing
| delete-statefulset-pod
AND Volume Attachment Recovery Policy
is immediate
- wait till replacement pod is generated (default is around 6 minutes, kubernetes setting)
- verify that you can access
vol1
via theshell
replacement pod under/mnt/vol1
once it is in the running state - verify that the original
shell
pod is stuck inPending
state forever
if NodeDownPodDeletionPolicy
is set to delete-deployment-pod
| delete-both-statefulset-and-deployment-pod
AND Volume Attachment Recovery Policy
is never
| wait
|immediate
- wait till the
pod.deletionTimestamp
has passed - verify that the pod is deleted and there is a new running replacement pod.
- verify that you can access
vol1
via theshell
replacement pod under/mnt/vol1
Other kinds
- Verify that Longhorn never deletes any other pod on the down node.
Test example
One typical scenario when the enhancement has succeeded is as below. When a node (say node-x
) goes down (assume using Kubernetes' default settings and user allows Longhorn to force delete pods):
Time | Event |
---|---|
0m:00s | node-x goes down and stops sending heartbeats to Kubernetes Node controller |
0m:40s | Kubernetes Node controller reports node-x is NotReady . |
5m:40s | Kubernetes Node controller starts evicting pods from node-x using graceful termination (set DeletionTimestamp and deletionGracePeriodSeconds = 10s/30s ) |
5m:50s/6m:10s | Longhorn forces delete the pod of StatefulSet/Deployment which uses Longhorn volume |
Upgrade strategy
Doesn't impact upgrade.