From 68f9e034b0aa5d74028ec661e46e715b46f8169b Mon Sep 17 00:00:00 2001
From: Phan Le <phan.le@rancher.com>
Date: Tue, 20 Oct 2020 22:30:42 -0700
Subject: [PATCH] add LEP for feature `handling volume remount by killing the
 consuming pod`

Longhorn #1719

Signed-off-by: Phan Le <phan.le@rancher.com>
---
 ...by-delete-and-recreate-the-workload-pod.md | 119 ++++++++++++++++++
 1 file changed, 119 insertions(+)
 create mode 100644 enhancements/20201020-recover-from-volume-failure-by-delete-and-recreate-the-workload-pod.md
diff --git a/enhancements/20201020-recover-from-volume-failure-by-delete-and-recreate-the-workload-pod.md b/enhancements/20201020-recover-from-volume-failure-by-delete-and-recreate-the-workload-pod.md
new file mode 100644
index 0000000..c6c1cda
--- /dev/null
+++ b/enhancements/20201020-recover-from-volume-failure-by-delete-and-recreate-the-workload-pod.md
@@ -0,0 +1,119 @@
+# Recover From Volume Failure by Delete and Recreate The Workload Pod
+
+## Summary
+
+[The current implementation](https://github.com/longhorn/longhorn-manager/blob/ba4ca64ad03911194c8586932e9f529e19c884a4/util/util.go#L712) of the remount feature doesn't work when the workload pod uses subpath. 
+This enhancement proposed a new way to handle volume remount which is deleting the workload Pod if it is controlled by a controller 
+(e.g. deployment, statefulset, daemonset).
+By doing so, Kubernetes will restart the pod, detach, attach, and remount the volume.
+
+### Related Issues
+
+https://github.com/longhorn/longhorn/issues/1719
+
+## Motivation
+
+### Goals
+
+Make sure that when volume attached after it is detached unexpectedly or after it is auto salvaged,
+the workload pod can use the volume even if the containers inside the pod use subpaths.
+
+## Solution Space
+
+### #1 Use `findmnt` to detect the mount point of subpath
+
+The command `findmnt <DEV-NAME>`can be used to find all existing mount points corresponding to the device.
+The example output of this command when a pod is using subpath:
+```
+root@worker-1:~# findmnt /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9 
+TARGET                                                                                                                                  SOURCE                                                        FSTYPE OPTIONS
+/var/lib/kubelet/pods/0429305b-faf1-4668-aaed-139a2cf4989c/volumes/kubernetes.io~csi/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/mount     /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9        ext4   rw,rela
+/var/lib/kubelet/pods/0429305b-faf1-4668-aaed-139a2cf4989c/volume-subpaths/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/nginx/0             /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9[/html] ext4   rw,rela
+```
+We can identify which mount point is subpath and which mount point mounts to the root of the volume's filesystem.
+Then, Longhorn can remount the mount point which mounts to the root of the volume's filesystem.
+
+### #2 Delete the workload pod
+
+Instead of manually finding and remounting all mount points of the volume, we delete the pod that has a controller. 
+The pod's controller will recreate it. After that, Kubernetes handles the reattachment and remount of the volume.
+ 
+This solves the issue that remount doesn't work when the workload pod uses subpath in PVC.
+
+## Proposal
+
+I would like to choose approach #2, `Delete the workload pod` because it is cleaner.
+Manually doing remount is cumbersome and leaves duplicated mount points on the host.
+
+### User Stories
+
+#### Story 1
+
+Users use subpath in PVC that is bounded to Longhorn volume.
+When the network goes bad, the volume becomes faulty (if there is no local replica).
+Longhorn auto salvages the volume when the network comes back. 
+Then Longhorn does auto-remount but [the current remount logic](https://github.com/longhorn/longhorn-manager/blob/ba4ca64ad03911194c8586932e9f529e19c884a4/util/util.go#L712) doesn't support subpath.
+
+### User Experience In Detail
+
+When users deploy workload using controller such as deployment, statefulset, or daemonset, 
+They are assured that volume gets reattached and remounted in case it is detached unexpectedly. 
+
+What about a pod without a controller? Users have to manually delete and recreate it.
+ 
+### API changes
+
+There is no API change
+
+## Design
+
+### Implementation Overview
+
+The idea is that `VolumeControler` will set `RemountRequestedAt` when the volume needs to remount. 
+The `KubernetesPodController` will compare `RemountRequestedAt` with the pod's `podStartTime`. 
+If pod's `startTime` < `vol.Status.RemountRequestAt`, `KubernetesPodController` deletes the pod. 
+We don't delete the pod immediately though. 
+Wait until `timeNow` > `vol.Status.RemountRequestedAt` + `delayDuration` (5s).
+The `delayDuration` is to make sure we don't repeatedly delete the pod too fast when `vol.Status.RemountRequestedAt` is updated too quickly by `VolumeController`
+After `KubernetesPodController` deletes the pod, there is no need to pass the information back to `VolumeController` because there is no need to reset the field `LastRemountRequestAt` which is just an event in the past.
+
+### Test plan
+
+1. Deploy a storage class with parameter `numberOfReplicas: 1` and `datalocality: best-effort`
+1. Deploy a statefulset with `replicas: 1` and using the above storageclass. 
+   Make sure the container in the pod template uses subpath, like this:
+   ```yaml
+   volumeMounts:
+   - name: <PVC-NAME>
+     mountPath: /mnt
+     subPath: html
+   ```
+1. Find the node where the statefulset pods are running. 
+   Let's say `pod-1` is on `node-1`, and use `vol-1`.
+1. exec into `pod-1`, create a file `test_data.txt` inside the folder `/mnt/html`
+1. Kill the replica instance manager pod on `node-1`. 
+   This action simulates a network disconnection (the engine process of the PVC cannot talk to the replica process on the killed replica instance manager pod).
+1. in a 2 minutes retry loop: 
+   Exec into the `pod-1`, run `ls /mnt/html`. 
+   Verify the file `test_data.txt` exists.
+1. Kill the replica instance manager pod on `node-1` one more time. 
+1. Wait for volume to become healthy, kill the replica instance manager pod on `node-1` one more time. 
+1. in a 2 minutes retry loop: 
+   Exec into the `pod-1`, run `ls /mnt/html`. 
+   Verify the file `test_data.txt` exists.
+   
+1. Update `numberOfReplicas` to 3.
+   Wait for replicas rebuilding finishes.
+1. Kill the engine instance manager pod on `node-1`
+1. In a 2 minutes retry loop:
+   Exec into the `pod-1`, run `ls /mnt/html`.
+   Verify the file `test_data.txt` exists.
+   
+1. Delete `pod-1`.
+1. in a 2 minutes retry loop: 
+   Exec into the `pod-1`, run `ls /mnt/html`. 
+   Verify the file `test_data.txt` exists.
+
+### Upgrade strategy
+
+There is no upgrade needed.