From 68f9e034b0aa5d74028ec661e46e715b46f8169b Mon Sep 17 00:00:00 2001 From: Phan Le Date: Tue, 20 Oct 2020 22:30:42 -0700 Subject: [PATCH] add LEP for feature `handling volume remount by killing the consuming pod` Longhorn #1719 Signed-off-by: Phan Le --- ...by-delete-and-recreate-the-workload-pod.md | 119 ++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100644 enhancements/20201020-recover-from-volume-failure-by-delete-and-recreate-the-workload-pod.md diff --git a/enhancements/20201020-recover-from-volume-failure-by-delete-and-recreate-the-workload-pod.md b/enhancements/20201020-recover-from-volume-failure-by-delete-and-recreate-the-workload-pod.md new file mode 100644 index 0000000..c6c1cda --- /dev/null +++ b/enhancements/20201020-recover-from-volume-failure-by-delete-and-recreate-the-workload-pod.md @@ -0,0 +1,119 @@ +# Recover From Volume Failure by Delete and Recreate The Workload Pod + +## Summary + +[The current implementation](https://github.com/longhorn/longhorn-manager/blob/ba4ca64ad03911194c8586932e9f529e19c884a4/util/util.go#L712) of the remount feature doesn't work when the workload pod uses subpath. +This enhancement proposed a new way to handle volume remount which is deleting the workload Pod if it is controlled by a controller +(e.g. deployment, statefulset, daemonset). +By doing so, Kubernetes will restart the pod, detach, attach, and remount the volume. + +### Related Issues + +https://github.com/longhorn/longhorn/issues/1719 + +## Motivation + +### Goals + +Make sure that when volume attached after it is detached unexpectedly or after it is auto salvaged, +the workload pod can use the volume even if the containers inside the pod use subpaths. + +## Solution Space + +### #1 Use `findmnt` to detect the mount point of subpath + +The command `findmnt `can be used to find all existing mount points corresponding to the device. +The example output of this command when a pod is using subpath: +``` +root@worker-1:~# findmnt /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9 +TARGET SOURCE FSTYPE OPTIONS +/var/lib/kubelet/pods/0429305b-faf1-4668-aaed-139a2cf4989c/volumes/kubernetes.io~csi/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/mount /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9 ext4 rw,rela +/var/lib/kubelet/pods/0429305b-faf1-4668-aaed-139a2cf4989c/volume-subpaths/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/nginx/0 /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9[/html] ext4 rw,rela +``` +We can identify which mount point is subpath and which mount point mounts to the root of the volume's filesystem. +Then, Longhorn can remount the mount point which mounts to the root of the volume's filesystem. + +### #2 Delete the workload pod + +Instead of manually finding and remounting all mount points of the volume, we delete the pod that has a controller. +The pod's controller will recreate it. After that, Kubernetes handles the reattachment and remount of the volume. + +This solves the issue that remount doesn't work when the workload pod uses subpath in PVC. + +## Proposal + +I would like to choose approach #2, `Delete the workload pod` because it is cleaner. +Manually doing remount is cumbersome and leaves duplicated mount points on the host. + +### User Stories + +#### Story 1 + +Users use subpath in PVC that is bounded to Longhorn volume. +When the network goes bad, the volume becomes faulty (if there is no local replica). +Longhorn auto salvages the volume when the network comes back. +Then Longhorn does auto-remount but [the current remount logic](https://github.com/longhorn/longhorn-manager/blob/ba4ca64ad03911194c8586932e9f529e19c884a4/util/util.go#L712) doesn't support subpath. + +### User Experience In Detail + +When users deploy workload using controller such as deployment, statefulset, or daemonset, +They are assured that volume gets reattached and remounted in case it is detached unexpectedly. + +What about a pod without a controller? Users have to manually delete and recreate it. + +### API changes + +There is no API change + +## Design + +### Implementation Overview + +The idea is that `VolumeControler` will set `RemountRequestedAt` when the volume needs to remount. +The `KubernetesPodController` will compare `RemountRequestedAt` with the pod's `podStartTime`. +If pod's `startTime` < `vol.Status.RemountRequestAt`, `KubernetesPodController` deletes the pod. +We don't delete the pod immediately though. +Wait until `timeNow` > `vol.Status.RemountRequestedAt` + `delayDuration` (5s). +The `delayDuration` is to make sure we don't repeatedly delete the pod too fast when `vol.Status.RemountRequestedAt` is updated too quickly by `VolumeController` +After `KubernetesPodController` deletes the pod, there is no need to pass the information back to `VolumeController` because there is no need to reset the field `LastRemountRequestAt` which is just an event in the past. + +### Test plan + +1. Deploy a storage class with parameter `numberOfReplicas: 1` and `datalocality: best-effort` +1. Deploy a statefulset with `replicas: 1` and using the above storageclass. + Make sure the container in the pod template uses subpath, like this: + ```yaml + volumeMounts: + - name: + mountPath: /mnt + subPath: html + ``` +1. Find the node where the statefulset pods are running. + Let's say `pod-1` is on `node-1`, and use `vol-1`. +1. exec into `pod-1`, create a file `test_data.txt` inside the folder `/mnt/html` +1. Kill the replica instance manager pod on `node-1`. + This action simulates a network disconnection (the engine process of the PVC cannot talk to the replica process on the killed replica instance manager pod). +1. in a 2 minutes retry loop: + Exec into the `pod-1`, run `ls /mnt/html`. + Verify the file `test_data.txt` exists. +1. Kill the replica instance manager pod on `node-1` one more time. +1. Wait for volume to become healthy, kill the replica instance manager pod on `node-1` one more time. +1. in a 2 minutes retry loop: + Exec into the `pod-1`, run `ls /mnt/html`. + Verify the file `test_data.txt` exists. + +1. Update `numberOfReplicas` to 3. + Wait for replicas rebuilding finishes. +1. Kill the engine instance manager pod on `node-1` +1. In a 2 minutes retry loop: + Exec into the `pod-1`, run `ls /mnt/html`. + Verify the file `test_data.txt` exists. + +1. Delete `pod-1`. +1. in a 2 minutes retry loop: + Exec into the `pod-1`, run `ls /mnt/html`. + Verify the file `test_data.txt` exists. + +### Upgrade strategy + +There is no upgrade needed.