Add LEP for live migration feature
Longhorn #2127 Signed-off-by: Joshua Moody <joshua.moody@rancher.com>
This commit is contained in:
parent
2fd9cc9459
commit
53220cc7da
235
enhancements/20210216-volume-live-migration.md
Normal file
235
enhancements/20210216-volume-live-migration.md
Normal file
@ -0,0 +1,235 @@
|
|||||||
|
# Volume live migration
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
To enable Harvester to utilize Kubevirts live migration support, we need to allow for volume live migration,
|
||||||
|
so that a Kubevirt triggered migration will lead to a volume migration from the old node to the new node.
|
||||||
|
|
||||||
|
### Related Issues
|
||||||
|
- https://github.com/longhorn/longhorn/issues/2127
|
||||||
|
- https://github.com/rancher/harvester/issues/384
|
||||||
|
- https://github.com/longhorn/longhorn/issues/87
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
### Goals
|
||||||
|
- Support Harvester VM live migration
|
||||||
|
|
||||||
|
### Non-goals
|
||||||
|
- Using multiple engines for faster volume failover for other scenarios than live migration
|
||||||
|
|
||||||
|
## Proposal
|
||||||
|
We want to add volume migration support so that we can use the VM live migration support of Kubevirt via Harvester.
|
||||||
|
By limiting this feature to that specific use case we can use the csi drivers attach / detach flow to implement migration interactions.
|
||||||
|
To do this, we need to be able to start a second engine for a volume on a different node that uses matching replicas of the first engine.
|
||||||
|
We only support this for a volume while it is used with `volumeMode=BLOCK`, since we don't support concurrent writes and having kubernetes mount a filesystem even in read only
|
||||||
|
mode can potentially lead to a modification of the filesystem (metadata, access time, journal replay, etc).
|
||||||
|
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
Previously the only way to support live migration in Harvester was using a Longhorn RWX volume that meant dealing with NFS and it's problems,
|
||||||
|
instead we want to add support for live migration for a traditional Longhorn volume this was previously implemented for the old RancherVM.
|
||||||
|
After this enhancement Longhorn will support a special `migratable` flag that allows for a Longhorn volume to be live migrated from one node to another.
|
||||||
|
The assumption here is that the initial consumer will never write again to the block device once the new consumer takes over.
|
||||||
|
|
||||||
|
|
||||||
|
### User Experience In Detail
|
||||||
|
|
||||||
|
#### Creating a migratable storage class
|
||||||
|
To test one needs to create a storage class with `migratable: "true"` set as a parameter.
|
||||||
|
Afterwards an RWX PVC is necessary since migratable volumes need to be able to be attached to multiple nodes.
|
||||||
|
```yaml
|
||||||
|
kind: StorageClass
|
||||||
|
apiVersion: storage.k8s.io/v1
|
||||||
|
metadata:
|
||||||
|
name: longhorn-migratable
|
||||||
|
provisioner: driver.longhorn.io
|
||||||
|
allowVolumeExpansion: true
|
||||||
|
parameters:
|
||||||
|
numberOfReplicas: "3"
|
||||||
|
staleReplicaTimeout: "2880" # 48 hours in minutes
|
||||||
|
fromBackup: ""
|
||||||
|
migratable: "true"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Testing Kubevirt VM live migration
|
||||||
|
We use CirOS as our test image for the live migration.
|
||||||
|
The login account is `cirros` and the password is `gocubsgo`.
|
||||||
|
To test with Harvester one can use the below example yamls as a quick start.
|
||||||
|
|
||||||
|
Deploy the below yaml so that Harvester will download the CirrOS image into the local Minio store.
|
||||||
|
NOTE: The CirrOS servers don't support the range request which Kubevirt importer uses, which is why we let harvester download the image first.
|
||||||
|
```yaml
|
||||||
|
apiVersion: harvester.cattle.io/v1alpha1
|
||||||
|
kind: VirtualMachineImage
|
||||||
|
metadata:
|
||||||
|
name: image-jxpnq
|
||||||
|
namespace: default
|
||||||
|
spec:
|
||||||
|
displayName: cirros-0.4.0-x86_64-disk.img
|
||||||
|
url: https://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
|
||||||
|
```
|
||||||
|
|
||||||
|
Afterwards deploy the `cirros-rwx-blk.yaml` to create a live migratabale virtual machine.
|
||||||
|
```yaml
|
||||||
|
apiVersion: kubevirt.io/v1alpha3
|
||||||
|
kind: VirtualMachine
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
harvester.cattle.io/creator: harvester
|
||||||
|
name: cirros-rwx-blk
|
||||||
|
spec:
|
||||||
|
dataVolumeTemplates:
|
||||||
|
- apiVersion: cdi.kubevirt.io/v1alpha1
|
||||||
|
kind: DataVolume
|
||||||
|
metadata:
|
||||||
|
annotations:
|
||||||
|
cdi.kubevirt.io/storage.import.requiresScratch: "true"
|
||||||
|
name: cirros-rwx-blk
|
||||||
|
spec:
|
||||||
|
pvc:
|
||||||
|
accessModes:
|
||||||
|
- ReadWriteMany
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
storage: 8Gi
|
||||||
|
storageClassName: longhorn-migratable
|
||||||
|
volumeMode: Block
|
||||||
|
source:
|
||||||
|
http:
|
||||||
|
certConfigMap: importer-ca-none
|
||||||
|
url: http://minio.harvester-system:9000/vm-images/image-jxpnq # locally downloaded cirros image
|
||||||
|
running: true
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
annotations:
|
||||||
|
harvester.cattle.io/diskNames: '["cirros-rwx-blk"]'
|
||||||
|
harvester.cattle.io/imageId: default/image-jxpnq
|
||||||
|
labels:
|
||||||
|
harvester.cattle.io/creator: harvester
|
||||||
|
harvester.cattle.io/vmName: cirros-rwx-blk
|
||||||
|
spec:
|
||||||
|
domain:
|
||||||
|
cpu:
|
||||||
|
cores: 1
|
||||||
|
sockets: 1
|
||||||
|
threads: 1
|
||||||
|
devices:
|
||||||
|
disks:
|
||||||
|
- disk:
|
||||||
|
bus: virtio
|
||||||
|
name: disk-0
|
||||||
|
inputs:
|
||||||
|
interfaces:
|
||||||
|
- masquerade: {}
|
||||||
|
model: virtio
|
||||||
|
name: default
|
||||||
|
machine:
|
||||||
|
type: q35
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
memory: 128M
|
||||||
|
hostname: cirros-rwx-blk
|
||||||
|
networks:
|
||||||
|
- name: default
|
||||||
|
pod: {}
|
||||||
|
volumes:
|
||||||
|
- dataVolume:
|
||||||
|
name: cirros-rwx-blk
|
||||||
|
name: disk-0
|
||||||
|
```
|
||||||
|
|
||||||
|
Once the `cirros-rwx` virtual machine is up and running deploy the `cirros-rwx-migration.yaml` to initiate a virtual machine live migration.
|
||||||
|
```yaml
|
||||||
|
apiVersion: kubevirt.io/v1alpha3
|
||||||
|
kind: VirtualMachineInstanceMigration
|
||||||
|
metadata:
|
||||||
|
name: cirros-rwx-blk
|
||||||
|
spec:
|
||||||
|
vmiName: cirros-rwx-blk
|
||||||
|
```
|
||||||
|
|
||||||
|
### API changes
|
||||||
|
- volume detach call now expects a `detachInput { hostId: "" }` if `hostId==""` it will be treated as detach from all nodes same behavior as before.
|
||||||
|
- csi driver now calls volume attach/detach for all volume types: RWO, RWX (NFS), RWX (Migratable).
|
||||||
|
- the api volume-manager now determines, whether attach/detach is necessary and valid instead of the csi driver.
|
||||||
|
|
||||||
|
#### Attach changes
|
||||||
|
1. If a volume is already attached (to the requested node) we will return the current volume.
|
||||||
|
2. If a volume is mode RWO,
|
||||||
|
it will be attached to the requested node,
|
||||||
|
unless it's attached already to a different node.
|
||||||
|
3. If a volume is mode RWX (NFS),
|
||||||
|
it will only be attached when requested in maintenance mode.
|
||||||
|
Since in other cases the volume is controlled by the share-manager.
|
||||||
|
4. If a volume is mode RWX (Migratable),
|
||||||
|
will initially be attached to the requested node unless already attached,
|
||||||
|
at which point a migration will be initiated to the new node.
|
||||||
|
|
||||||
|
|
||||||
|
#### Detach changes
|
||||||
|
1. If a volume is already detached (from all, from the requested node) we will return the current volume.
|
||||||
|
2. If a volume is mode RWO,
|
||||||
|
It will be detached from the requested node.
|
||||||
|
3. If a volume is mode RWX (NFS),
|
||||||
|
it will only be detached if it's currently attached in maintenance mode.
|
||||||
|
Since in other cases the volume is controlled by the share-manager.
|
||||||
|
4. If a volume is mode RWX (Migratable)
|
||||||
|
It will be detached from the requested node.
|
||||||
|
if a migration is in progress then depending on the requested node to detach different migration actions will happen.
|
||||||
|
A migration confirmation will be triggered if the detach request is for the first node.
|
||||||
|
A migration rollback will be triggered if the detach request is for the second node.
|
||||||
|
|
||||||
|
## Design
|
||||||
|
|
||||||
|
### Implementation Overview
|
||||||
|
|
||||||
|
#### Volume migration flow
|
||||||
|
The live migration intention is triggered and evaluated via the attach/detach calls.
|
||||||
|
The expectation is that Kubernetes will bring up a new pod that requests attachment of the already attached volume.
|
||||||
|
This will initiate the migration start, after this there are two things that can happen.
|
||||||
|
Either Kubernetes will terminate the new pod which is equivalent to a migration rollback, or
|
||||||
|
the old pod will be terminated which is equivalent to a migration complete operation.
|
||||||
|
|
||||||
|
1. Users launch a new VM with a new migratable Longhorn volume ->
|
||||||
|
A migratable volume is created then attached to node1.
|
||||||
|
Similar to regular attachment, Longhorn will set `v.spec.nodeID` to `node1` here.
|
||||||
|
2. Users launch the 2nd VM (pod) with the same Longhorn volume ->
|
||||||
|
1. Kubernetes requests that the volume (already attached) be attached to node2.
|
||||||
|
Then Longhorn receives the attach call and set `v.spec.migrationNodeID` to `node2` with `v.spec.nodeID = node1`.
|
||||||
|
2. Longhorn volume-controller brings up the new engine on node2, with inactive matching replicas (same as live engine upgrade)
|
||||||
|
3. Longhorn CSI driver polls for the existence of the second engine on node2 before acknowledging attachment success.
|
||||||
|
3. Once the migration is started (running engines on both nodes),
|
||||||
|
the following detach decides whether migration is completed successfully,
|
||||||
|
or a migration rollback is desired:
|
||||||
|
1. If succeeded: Kubevirt will remove the original pod on `node1`,
|
||||||
|
this will lead to requesting detachment from node1, which will lead to longhorn setting
|
||||||
|
`v.spec.nodeID` to `node2` and unsetting `v.spec.migrationNodeID`
|
||||||
|
2. If failed: Kubevirt will terminate the new pod on `node2`,
|
||||||
|
this will lead to requesting detachment from node2, which will lead to longhorn keeping
|
||||||
|
`v.spec.nodeID` to `node1` and unsetting `v.spec.migrationNodeID`
|
||||||
|
4. Longhorn volume controller then cleans up the second engine and switches the active replicas to be the current engine ones.
|
||||||
|
|
||||||
|
In summary:
|
||||||
|
```
|
||||||
|
n1 | vm1 has the volume attached (v.spec.nodeID = n1)
|
||||||
|
n2 | vm2 requests attachment [migrationStart] -> (v.spec.migrationNodeID = n2)
|
||||||
|
volume-controller brings up new engine on n2, with inactive matching replicas (same as live engine upgrade)
|
||||||
|
csi driver polls for existence of second engine on n2 before acknowledging attach
|
||||||
|
|
||||||
|
The following detach decides whether a migration is completed successfully, or a migration rollback is desired.
|
||||||
|
n1 | vm1 requests detach of n1 [migrationComplete] -> (v.spec.nodeID = n2, v.spec.migrationNodeID = "")
|
||||||
|
n2 | vm2 requests detach of n2 [migrationRollback] -> (v.spec.NodeID = n1, v.spec.migrationNodeID = "")
|
||||||
|
The volume controller then cleans up the second engine and switches the active replicas to be the current engine ones.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test plan
|
||||||
|
|
||||||
|
#### E2E tests
|
||||||
|
- E2E test for migration successful
|
||||||
|
- E2E test for migration rollback
|
||||||
|
|
||||||
|
### Upgrade strategy
|
||||||
|
Requires using a storage class with `migratable: "true"` parameter for the harvester volumes
|
||||||
|
as well as an RWX PVC to allow live migration in Kubernetes/Kubevirt.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user