From 53220cc7dae1f11a5a918bfcad21e4f32f70035c Mon Sep 17 00:00:00 2001 From: Joshua Moody Date: Fri, 19 Feb 2021 14:41:27 -0800 Subject: [PATCH] Add LEP for live migration feature Longhorn #2127 Signed-off-by: Joshua Moody --- .../20210216-volume-live-migration.md | 235 ++++++++++++++++++ 1 file changed, 235 insertions(+) create mode 100644 enhancements/20210216-volume-live-migration.md diff --git a/enhancements/20210216-volume-live-migration.md b/enhancements/20210216-volume-live-migration.md new file mode 100644 index 0000000..b5ddf52 --- /dev/null +++ b/enhancements/20210216-volume-live-migration.md @@ -0,0 +1,235 @@ +# Volume live migration + +## Summary + +To enable Harvester to utilize Kubevirts live migration support, we need to allow for volume live migration, +so that a Kubevirt triggered migration will lead to a volume migration from the old node to the new node. + +### Related Issues +- https://github.com/longhorn/longhorn/issues/2127 +- https://github.com/rancher/harvester/issues/384 +- https://github.com/longhorn/longhorn/issues/87 + +## Motivation + +### Goals +- Support Harvester VM live migration + +### Non-goals +- Using multiple engines for faster volume failover for other scenarios than live migration + +## Proposal +We want to add volume migration support so that we can use the VM live migration support of Kubevirt via Harvester. +By limiting this feature to that specific use case we can use the csi drivers attach / detach flow to implement migration interactions. +To do this, we need to be able to start a second engine for a volume on a different node that uses matching replicas of the first engine. +We only support this for a volume while it is used with `volumeMode=BLOCK`, since we don't support concurrent writes and having kubernetes mount a filesystem even in read only +mode can potentially lead to a modification of the filesystem (metadata, access time, journal replay, etc). + + +### User Stories +Previously the only way to support live migration in Harvester was using a Longhorn RWX volume that meant dealing with NFS and it's problems, +instead we want to add support for live migration for a traditional Longhorn volume this was previously implemented for the old RancherVM. +After this enhancement Longhorn will support a special `migratable` flag that allows for a Longhorn volume to be live migrated from one node to another. +The assumption here is that the initial consumer will never write again to the block device once the new consumer takes over. + + +### User Experience In Detail + +#### Creating a migratable storage class +To test one needs to create a storage class with `migratable: "true"` set as a parameter. +Afterwards an RWX PVC is necessary since migratable volumes need to be able to be attached to multiple nodes. +```yaml +kind: StorageClass +apiVersion: storage.k8s.io/v1 +metadata: + name: longhorn-migratable +provisioner: driver.longhorn.io +allowVolumeExpansion: true +parameters: + numberOfReplicas: "3" + staleReplicaTimeout: "2880" # 48 hours in minutes + fromBackup: "" + migratable: "true" +``` + +#### Testing Kubevirt VM live migration +We use CirOS as our test image for the live migration. +The login account is `cirros` and the password is `gocubsgo`. +To test with Harvester one can use the below example yamls as a quick start. + +Deploy the below yaml so that Harvester will download the CirrOS image into the local Minio store. +NOTE: The CirrOS servers don't support the range request which Kubevirt importer uses, which is why we let harvester download the image first. +```yaml +apiVersion: harvester.cattle.io/v1alpha1 +kind: VirtualMachineImage +metadata: + name: image-jxpnq + namespace: default +spec: + displayName: cirros-0.4.0-x86_64-disk.img + url: https://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img +``` + +Afterwards deploy the `cirros-rwx-blk.yaml` to create a live migratabale virtual machine. +```yaml +apiVersion: kubevirt.io/v1alpha3 +kind: VirtualMachine +metadata: + labels: + harvester.cattle.io/creator: harvester + name: cirros-rwx-blk +spec: + dataVolumeTemplates: + - apiVersion: cdi.kubevirt.io/v1alpha1 + kind: DataVolume + metadata: + annotations: + cdi.kubevirt.io/storage.import.requiresScratch: "true" + name: cirros-rwx-blk + spec: + pvc: + accessModes: + - ReadWriteMany + resources: + requests: + storage: 8Gi + storageClassName: longhorn-migratable + volumeMode: Block + source: + http: + certConfigMap: importer-ca-none + url: http://minio.harvester-system:9000/vm-images/image-jxpnq # locally downloaded cirros image + running: true + template: + metadata: + annotations: + harvester.cattle.io/diskNames: '["cirros-rwx-blk"]' + harvester.cattle.io/imageId: default/image-jxpnq + labels: + harvester.cattle.io/creator: harvester + harvester.cattle.io/vmName: cirros-rwx-blk + spec: + domain: + cpu: + cores: 1 + sockets: 1 + threads: 1 + devices: + disks: + - disk: + bus: virtio + name: disk-0 + inputs: + interfaces: + - masquerade: {} + model: virtio + name: default + machine: + type: q35 + resources: + requests: + memory: 128M + hostname: cirros-rwx-blk + networks: + - name: default + pod: {} + volumes: + - dataVolume: + name: cirros-rwx-blk + name: disk-0 +``` + +Once the `cirros-rwx` virtual machine is up and running deploy the `cirros-rwx-migration.yaml` to initiate a virtual machine live migration. +```yaml +apiVersion: kubevirt.io/v1alpha3 +kind: VirtualMachineInstanceMigration +metadata: + name: cirros-rwx-blk +spec: + vmiName: cirros-rwx-blk +``` + +### API changes +- volume detach call now expects a `detachInput { hostId: "" }` if `hostId==""` it will be treated as detach from all nodes same behavior as before. +- csi driver now calls volume attach/detach for all volume types: RWO, RWX (NFS), RWX (Migratable). +- the api volume-manager now determines, whether attach/detach is necessary and valid instead of the csi driver. + +#### Attach changes +1. If a volume is already attached (to the requested node) we will return the current volume. +2. If a volume is mode RWO, + it will be attached to the requested node, + unless it's attached already to a different node. +3. If a volume is mode RWX (NFS), + it will only be attached when requested in maintenance mode. + Since in other cases the volume is controlled by the share-manager. +4. If a volume is mode RWX (Migratable), + will initially be attached to the requested node unless already attached, + at which point a migration will be initiated to the new node. + + +#### Detach changes +1. If a volume is already detached (from all, from the requested node) we will return the current volume. +2. If a volume is mode RWO, + It will be detached from the requested node. +3. If a volume is mode RWX (NFS), + it will only be detached if it's currently attached in maintenance mode. + Since in other cases the volume is controlled by the share-manager. +4. If a volume is mode RWX (Migratable) + It will be detached from the requested node. + if a migration is in progress then depending on the requested node to detach different migration actions will happen. + A migration confirmation will be triggered if the detach request is for the first node. + A migration rollback will be triggered if the detach request is for the second node. + +## Design + +### Implementation Overview + +#### Volume migration flow +The live migration intention is triggered and evaluated via the attach/detach calls. +The expectation is that Kubernetes will bring up a new pod that requests attachment of the already attached volume. +This will initiate the migration start, after this there are two things that can happen. +Either Kubernetes will terminate the new pod which is equivalent to a migration rollback, or +the old pod will be terminated which is equivalent to a migration complete operation. + +1. Users launch a new VM with a new migratable Longhorn volume -> + A migratable volume is created then attached to node1. + Similar to regular attachment, Longhorn will set `v.spec.nodeID` to `node1` here. +2. Users launch the 2nd VM (pod) with the same Longhorn volume -> + 1. Kubernetes requests that the volume (already attached) be attached to node2. + Then Longhorn receives the attach call and set `v.spec.migrationNodeID` to `node2` with `v.spec.nodeID = node1`. + 2. Longhorn volume-controller brings up the new engine on node2, with inactive matching replicas (same as live engine upgrade) + 3. Longhorn CSI driver polls for the existence of the second engine on node2 before acknowledging attachment success. +3. Once the migration is started (running engines on both nodes), + the following detach decides whether migration is completed successfully, + or a migration rollback is desired: + 1. If succeeded: Kubevirt will remove the original pod on `node1`, + this will lead to requesting detachment from node1, which will lead to longhorn setting + `v.spec.nodeID` to `node2` and unsetting `v.spec.migrationNodeID` + 2. If failed: Kubevirt will terminate the new pod on `node2`, + this will lead to requesting detachment from node2, which will lead to longhorn keeping + `v.spec.nodeID` to `node1` and unsetting `v.spec.migrationNodeID` +4. Longhorn volume controller then cleans up the second engine and switches the active replicas to be the current engine ones. + +In summary: +``` +n1 | vm1 has the volume attached (v.spec.nodeID = n1) +n2 | vm2 requests attachment [migrationStart] -> (v.spec.migrationNodeID = n2) +volume-controller brings up new engine on n2, with inactive matching replicas (same as live engine upgrade) +csi driver polls for existence of second engine on n2 before acknowledging attach + +The following detach decides whether a migration is completed successfully, or a migration rollback is desired. +n1 | vm1 requests detach of n1 [migrationComplete] -> (v.spec.nodeID = n2, v.spec.migrationNodeID = "") +n2 | vm2 requests detach of n2 [migrationRollback] -> (v.spec.NodeID = n1, v.spec.migrationNodeID = "") +The volume controller then cleans up the second engine and switches the active replicas to be the current engine ones. +``` + +### Test plan + +#### E2E tests +- E2E test for migration successful +- E2E test for migration rollback + +### Upgrade strategy +Requires using a storage class with `migratable: "true"` parameter for the harvester volumes +as well as an RWX PVC to allow live migration in Kubernetes/Kubevirt. +