From 53220cc7dae1f11a5a918bfcad21e4f32f70035c Mon Sep 17 00:00:00 2001
From: Joshua Moody <joshua.moody@rancher.com>
Date: Fri, 19 Feb 2021 14:41:27 -0800
Subject: [PATCH] Add LEP for live migration feature

Longhorn #2127

Signed-off-by: Joshua Moody <joshua.moody@rancher.com>
---
 .../20210216-volume-live-migration.md         | 235 ++++++++++++++++++
 1 file changed, 235 insertions(+)
 create mode 100644 enhancements/20210216-volume-live-migration.md

diff --git a/enhancements/20210216-volume-live-migration.md b/enhancements/20210216-volume-live-migration.md
new file mode 100644
index 0000000..b5ddf52
--- /dev/null
+++ b/enhancements/20210216-volume-live-migration.md
@@ -0,0 +1,235 @@
+# Volume live migration
+
+## Summary
+
+To enable Harvester to utilize Kubevirts live migration support, we need to allow for volume live migration,
+so that a Kubevirt triggered migration will lead to a volume migration from the old node to the new node.
+
+### Related Issues
+- https://github.com/longhorn/longhorn/issues/2127
+- https://github.com/rancher/harvester/issues/384
+- https://github.com/longhorn/longhorn/issues/87
+
+## Motivation
+
+### Goals
+- Support Harvester VM live migration
+
+### Non-goals
+- Using multiple engines for faster volume failover for other scenarios than live migration
+
+## Proposal
+We want to add volume migration support so that we can use the VM live migration support of Kubevirt via Harvester.
+By limiting this feature to that specific use case we can use the csi drivers attach / detach flow to implement migration interactions.
+To do this, we need to be able to start a second engine for a volume on a different node that uses matching replicas of the first engine.
+We only support this for a volume while it is used with `volumeMode=BLOCK`, since we don't support concurrent writes and having kubernetes mount a filesystem even in read only
+mode can potentially lead to a modification of the filesystem (metadata, access time, journal replay, etc).
+
+
+### User Stories
+Previously the only way to support live migration in Harvester was using a Longhorn RWX volume that meant dealing with NFS and it's problems, 
+instead we want to add support for live migration for a traditional Longhorn volume this was previously implemented for the old RancherVM.
+After this enhancement Longhorn will support a special `migratable` flag that allows for a Longhorn volume to be live migrated from one node to another.
+The assumption here is that the initial consumer will never write again to the block device once the new consumer takes over.
+
+
+### User Experience In Detail
+
+#### Creating a migratable storage class
+To test one needs to create a storage class with `migratable: "true"` set as a parameter.
+Afterwards an RWX PVC is necessary since migratable volumes need to be able to be attached to multiple nodes.
+```yaml
+kind: StorageClass
+apiVersion: storage.k8s.io/v1
+metadata:
+  name: longhorn-migratable
+provisioner: driver.longhorn.io
+allowVolumeExpansion: true
+parameters:
+  numberOfReplicas: "3"
+  staleReplicaTimeout: "2880" # 48 hours in minutes
+  fromBackup: ""
+  migratable: "true"
+```
+
+#### Testing Kubevirt VM live migration
+We use CirOS as our test image for the live migration.
+The login account is `cirros` and the password is `gocubsgo`.
+To test with Harvester one can use the below example yamls as a quick start.
+
+Deploy the below yaml so that Harvester will download the CirrOS image into the local Minio store.
+NOTE: The CirrOS servers don't support the range request which Kubevirt importer uses, which is why we let harvester download the image first.
+```yaml
+apiVersion: harvester.cattle.io/v1alpha1
+kind: VirtualMachineImage
+metadata:
+  name: image-jxpnq
+  namespace: default
+spec:
+  displayName: cirros-0.4.0-x86_64-disk.img
+  url: https://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
+```
+
+Afterwards deploy the `cirros-rwx-blk.yaml` to create a live migratabale virtual machine.
+```yaml
+apiVersion: kubevirt.io/v1alpha3
+kind: VirtualMachine
+metadata:
+  labels:
+    harvester.cattle.io/creator: harvester
+  name: cirros-rwx-blk
+spec:
+  dataVolumeTemplates:
+  - apiVersion: cdi.kubevirt.io/v1alpha1
+    kind: DataVolume
+    metadata:
+      annotations:
+        cdi.kubevirt.io/storage.import.requiresScratch: "true"
+      name: cirros-rwx-blk
+    spec:
+      pvc:
+        accessModes:
+        - ReadWriteMany
+        resources:
+          requests:
+            storage: 8Gi
+        storageClassName: longhorn-migratable
+        volumeMode: Block 
+      source:
+        http:
+          certConfigMap: importer-ca-none
+          url: http://minio.harvester-system:9000/vm-images/image-jxpnq # locally downloaded cirros image
+  running: true
+  template:
+    metadata:
+      annotations:
+        harvester.cattle.io/diskNames: '["cirros-rwx-blk"]'
+        harvester.cattle.io/imageId: default/image-jxpnq
+      labels:
+        harvester.cattle.io/creator: harvester
+        harvester.cattle.io/vmName: cirros-rwx-blk
+    spec:
+      domain:
+        cpu:
+          cores: 1
+          sockets: 1
+          threads: 1
+        devices:
+          disks:
+          - disk:
+              bus: virtio
+            name: disk-0
+          inputs:
+          interfaces:
+          - masquerade: {}
+            model: virtio
+            name: default
+        machine:
+          type: q35
+        resources:
+          requests:
+            memory: 128M
+      hostname: cirros-rwx-blk
+      networks:
+      - name: default
+        pod: {}
+      volumes:
+      - dataVolume:
+          name: cirros-rwx-blk
+        name: disk-0
+```
+
+Once the `cirros-rwx` virtual machine is up and running deploy the `cirros-rwx-migration.yaml` to initiate a virtual machine live migration.
+```yaml
+apiVersion: kubevirt.io/v1alpha3
+kind: VirtualMachineInstanceMigration
+metadata:
+  name: cirros-rwx-blk
+spec:
+  vmiName: cirros-rwx-blk
+```
+
+### API changes
+- volume detach call now expects a `detachInput { hostId: "" }` if `hostId==""` it will be treated as detach from all nodes same behavior as before.
+- csi driver now calls volume attach/detach for all volume types: RWO, RWX (NFS), RWX (Migratable).
+- the api volume-manager now determines, whether attach/detach is necessary and valid instead of the csi driver.
+
+#### Attach changes
+1. If a volume is already attached (to the requested node) we will return the current volume.
+2. If a volume is mode RWO, 
+   it will be attached to the requested node, 
+   unless it's attached already to a different node.
+3. If a volume is mode RWX (NFS), 
+   it will only be attached when requested in maintenance mode.
+   Since in other cases the volume is controlled by the share-manager.
+4. If a volume is mode RWX (Migratable),
+   will initially be attached to the requested node unless already attached,
+   at which point a migration will be initiated to the new node.
+
+
+#### Detach changes
+1. If a volume is already detached (from all, from the requested node) we will return the current volume.
+2. If a volume is mode RWO, 
+   It will be detached from the requested node.
+3. If a volume is mode RWX (NFS), 
+   it will only be detached if it's currently attached in maintenance mode.
+   Since in other cases the volume is controlled by the share-manager.
+4. If a volume is mode RWX (Migratable)
+   It will be detached from the requested node.
+   if a migration is in progress then depending on the requested node to detach different migration actions will happen.
+   A migration confirmation will be triggered if the detach request is for the first node.
+   A migration rollback will be triggered if the detach request is for the second node.
+
+## Design
+
+### Implementation Overview
+
+#### Volume migration flow
+The live migration intention is triggered and evaluated via the attach/detach calls.
+The expectation is that Kubernetes will bring up a new pod that requests attachment of the already attached volume.
+This will initiate the migration start, after this there are two things that can happen. 
+Either Kubernetes will terminate the new pod which is equivalent to a migration rollback, or
+the old pod will be terminated which is equivalent to a migration complete operation.
+
+1. Users launch a new VM with a new migratable Longhorn volume -> 
+   A migratable volume is created then attached to node1. 
+   Similar to regular attachment, Longhorn will set `v.spec.nodeID` to `node1` here.
+2. Users launch the 2nd VM (pod) with the same Longhorn volume ->
+    1. Kubernetes requests that the volume (already attached) be attached to node2. 
+       Then Longhorn receives the attach call and set `v.spec.migrationNodeID` to `node2` with `v.spec.nodeID = node1`.
+    2. Longhorn volume-controller brings up the new engine on node2, with inactive matching replicas (same as live engine upgrade)
+    3. Longhorn CSI driver polls for the existence of the second engine on node2 before acknowledging attachment success.
+3. Once the migration is started (running engines on both nodes), 
+   the following detach decides whether migration is completed successfully, 
+   or a migration rollback is desired:
+    1. If succeeded: Kubevirt will remove the original pod on `node1`,
+       this will lead to requesting detachment from node1, which will lead to longhorn setting 
+       `v.spec.nodeID` to `node2` and unsetting `v.spec.migrationNodeID`
+    2. If failed: Kubevirt will terminate the new pod on `node2`,
+       this will lead to requesting detachment from node2, which will lead to longhorn keeping
+       `v.spec.nodeID` to `node1` and unsetting `v.spec.migrationNodeID` 
+4. Longhorn volume controller then cleans up the second engine and switches the active replicas to be the current engine ones.
+
+In summary:
+```
+n1 | vm1 has the volume attached (v.spec.nodeID = n1)
+n2 | vm2 requests attachment [migrationStart] -> (v.spec.migrationNodeID = n2) 
+volume-controller brings up new engine on n2, with inactive matching replicas (same as live engine upgrade)
+csi driver polls for existence of second engine on n2 before acknowledging attach
+
+The following detach decides whether a migration is completed successfully, or a migration rollback is desired.
+n1 | vm1 requests detach of n1 [migrationComplete] -> (v.spec.nodeID = n2, v.spec.migrationNodeID = "") 
+n2 | vm2 requests detach of n2 [migrationRollback] -> (v.spec.NodeID = n1, v.spec.migrationNodeID = "")
+The volume controller then cleans up the second engine and switches the active replicas to be the current engine ones.
+```
+
+### Test plan
+
+#### E2E tests
+- E2E test for migration successful
+- E2E test for migration rollback
+
+### Upgrade strategy
+Requires using a storage class with `migratable: "true"` parameter for the harvester volumes 
+as well as an RWX PVC to allow live migration in Kubernetes/Kubevirt.
+