longhorn/enhancements/20201020-recover-from-volume-failure-by-delete-and-recreate-the-workload-pod.md
Phan Le 68f9e034b0 add LEP for feature handling volume remount by killing the consuming pod
Longhorn #1719

Signed-off-by: Phan Le <phan.le@rancher.com>
2020-10-21 11:30:52 -07:00

5.7 KiB

Recover From Volume Failure by Delete and Recreate The Workload Pod

Summary

The current implementation of the remount feature doesn't work when the workload pod uses subpath. This enhancement proposed a new way to handle volume remount which is deleting the workload Pod if it is controlled by a controller (e.g. deployment, statefulset, daemonset). By doing so, Kubernetes will restart the pod, detach, attach, and remount the volume.

https://github.com/longhorn/longhorn/issues/1719

Motivation

Goals

Make sure that when volume attached after it is detached unexpectedly or after it is auto salvaged, the workload pod can use the volume even if the containers inside the pod use subpaths.

Solution Space

#1 Use findmnt to detect the mount point of subpath

The command findmnt <DEV-NAME>can be used to find all existing mount points corresponding to the device. The example output of this command when a pod is using subpath:

root@worker-1:~# findmnt /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9 
TARGET                                                                                                                                  SOURCE                                                        FSTYPE OPTIONS
/var/lib/kubelet/pods/0429305b-faf1-4668-aaed-139a2cf4989c/volumes/kubernetes.io~csi/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/mount     /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9        ext4   rw,rela
/var/lib/kubelet/pods/0429305b-faf1-4668-aaed-139a2cf4989c/volume-subpaths/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9/nginx/0             /dev/longhorn/pvc-1ce69c7e-90ce-41ce-a88e-8b968d9a8ff9[/html] ext4   rw,rela

We can identify which mount point is subpath and which mount point mounts to the root of the volume's filesystem. Then, Longhorn can remount the mount point which mounts to the root of the volume's filesystem.

#2 Delete the workload pod

Instead of manually finding and remounting all mount points of the volume, we delete the pod that has a controller. The pod's controller will recreate it. After that, Kubernetes handles the reattachment and remount of the volume.

This solves the issue that remount doesn't work when the workload pod uses subpath in PVC.

Proposal

I would like to choose approach #2, Delete the workload pod because it is cleaner. Manually doing remount is cumbersome and leaves duplicated mount points on the host.

User Stories

Story 1

Users use subpath in PVC that is bounded to Longhorn volume. When the network goes bad, the volume becomes faulty (if there is no local replica). Longhorn auto salvages the volume when the network comes back. Then Longhorn does auto-remount but the current remount logic doesn't support subpath.

User Experience In Detail

When users deploy workload using controller such as deployment, statefulset, or daemonset, They are assured that volume gets reattached and remounted in case it is detached unexpectedly.

What about a pod without a controller? Users have to manually delete and recreate it.

API changes

There is no API change

Design

Implementation Overview

The idea is that VolumeControler will set RemountRequestedAt when the volume needs to remount. The KubernetesPodController will compare RemountRequestedAt with the pod's podStartTime. If pod's startTime < vol.Status.RemountRequestAt, KubernetesPodController deletes the pod. We don't delete the pod immediately though. Wait until timeNow > vol.Status.RemountRequestedAt + delayDuration (5s). The delayDuration is to make sure we don't repeatedly delete the pod too fast when vol.Status.RemountRequestedAt is updated too quickly by VolumeController After KubernetesPodController deletes the pod, there is no need to pass the information back to VolumeController because there is no need to reset the field LastRemountRequestAt which is just an event in the past.

Test plan

  1. Deploy a storage class with parameter numberOfReplicas: 1 and datalocality: best-effort

  2. Deploy a statefulset with replicas: 1 and using the above storageclass. Make sure the container in the pod template uses subpath, like this:

    volumeMounts:
    - name: <PVC-NAME>
      mountPath: /mnt
      subPath: html
    
  3. Find the node where the statefulset pods are running. Let's say pod-1 is on node-1, and use vol-1.

  4. exec into pod-1, create a file test_data.txt inside the folder /mnt/html

  5. Kill the replica instance manager pod on node-1. This action simulates a network disconnection (the engine process of the PVC cannot talk to the replica process on the killed replica instance manager pod).

  6. in a 2 minutes retry loop: Exec into the pod-1, run ls /mnt/html. Verify the file test_data.txt exists.

  7. Kill the replica instance manager pod on node-1 one more time.

  8. Wait for volume to become healthy, kill the replica instance manager pod on node-1 one more time.

  9. in a 2 minutes retry loop: Exec into the pod-1, run ls /mnt/html. Verify the file test_data.txt exists.

  10. Update numberOfReplicas to 3. Wait for replicas rebuilding finishes.

  11. Kill the engine instance manager pod on node-1

  12. In a 2 minutes retry loop: Exec into the pod-1, run ls /mnt/html. Verify the file test_data.txt exists.

  13. Delete pod-1.

  14. in a 2 minutes retry loop: Exec into the pod-1, run ls /mnt/html. Verify the file test_data.txt exists.

Upgrade strategy

There is no upgrade needed.