Add LEP for rwx feature

Longhorn #1470

Signed-off-by: Joshua Moody <joshua.moody@rancher.com>
This commit is contained in:
Joshua Moody 2021-01-19 09:13:35 -08:00 committed by Sheng Yang
parent 7c78b28314
commit f45f0512dc

View File

@ -0,0 +1,212 @@
# RWX volume support
## Summary
We want to natively support RWX volumes, the creation and usage of these rwx volumes should preferably be transparent to the user.
This would make it so that there is no manual user interaction necessary, and a rwx volume just looks the same as regular Longhorn volume.
This would also allow any Longhorn volume to be used as rwx, without special requirements.
### Related Issues
https://github.com/longhorn/longhorn/issues/1470
https://github.com/Longhorn/Longhorn/issues/1183
## Motivation
### Goals
- support creation of RWX volumes via Longhorn
- support creation of RWX volumes via RWX pvc's
- support mounting NFS shares via the CSI driver
- creation of a share-manager that manages and exports volumes via NFS
### Non-goals
- clustered NFS (highly available)
- distributed filesystem
## Proposal
RWX volumes should support all operations that a regular RWO volumes supports (backup & restore, DR volume support, etc)
### User Stories
#### Native RWX support
Before this enhancement we create an RWX provisioner example that was using a regular Longhorn volume to export multiple NFS shares.
The provisioner then created native kubernetes NFS persistent volumes, there where many limitations with this approach (multiple workload pvs on one Longhorn volume, restore & backup is iffy, etc)
After this enhancement anytime a user uses an RWX pvc we will provision a Longhorn volume and expose the volume via a share-manager.
The CSI driver will then mount this volume that is exported via a NFS server from the share-manager pod.
### User Experience In Detail
Users can automatically provision and use an RWX volume by having their workload use an RWX pvc.
Users can see the status of their RWX volumes in the Longhorn UI, same as for RWO volumes.
Users can use the created RWX volume on different nodes at the same time.
### API changes
- add `AccessMode` field to the `Volume.Spec`
- add `ShareEndpoint, ShareState` fields to the `Volume.Status`
- add a new ShareManager crd, details below
```go
type ShareManagerState string
const (
ShareManagerStateUnknown = ShareManagerState("unknown")
ShareManagerStateStarting = ShareManagerState("starting")
ShareManagerStateRunning = ShareManagerState("running")
ShareManagerStateStopped = ShareManagerState("stopped")
ShareManagerStateError = ShareManagerState("error")
)
type ShareManagerSpec struct {
Image string `json:"image"`
}
type ShareManagerStatus struct {
OwnerID string `json:"ownerID"`
State ShareManagerState `json:"state"`
Endpoint string `json:"endpoint"`
}
```
### Implementation Overview
#### Key Components
- Volume controller is responsible for creation of share manager crd and synchronising share status and endpoint of the volume with the share manager resource.
- Share manager controller will be responsible for managing share manager pods and ensuring volume attachment to the share manager pod.
- Share manager pod is responsible for health checking and managing the NFS server volume export.
- CSI driver is responsible for mounting the NFS export to the workload pod
![Architecture Diagram](https://longhorn.io/img/diagrams/rwx/rwx-native-architecture.png)
#### From volume creation to usage
When a new RWX volume with name `test-volume` is created, the volume controller will create a matching share-manager resource with name `test-volume`.
The share-manager-controller will pickup this new share-manager resource and create share-manager pod with name `share-manager-test-volume` in the longhorn-system namespace,
as well as a service named `test-volume` that always points to the share-manager-pod `share-manager-test-volume`.
The controller will set the `State=Starting` while the share-manager-pod is Pending and not `Ready` yet.
The share-manager-pod is running our share-manager image which allows for exporting a block device via ganesha (NFS server).
After starting the share-manager, the application waits for the attachment of the volume `test-volume`,
this is done by an availability check of the block device in the bind-mounted `/host/dev/longhorn/test-volume` folder.
The actual volume attachment is handled by the share-manager-controller setting volume.spec.nodeID to the `node` of the share-manager-pod.
Once the volume is attached the share-manager will mount the volume, create export config and start ganesha (NFS server).
Afterwards the share-manager will do periodic health check against the attached volume and on failure of a health check the pod will terminate.
The share-manager pod will become `ready` as soon as ganesha is up and running this is accomplished via a check against `/var/run/ganesha.pid`.
The share-manager-controller can now update the share-manager `State=Running` and `Endpoint=nfs://service-cluster-ip/test-volume"`.
The volume-controller will update the volumes `ShareState` and `ShareEndpoint` based on the values of the share-manager `State` and `Endpoint`.
Once the volumes `ShareState` is `Running` the csi-driver can now successfully attach & mount the volume into the workload pods.
#### Recovering from share-manager node/volume/pod failures
On node failure Kubernetes will mark the share-manager-pod `share-manager-test-volume` as terminating.
The share-manager-controller will mark the share-manager `State=Error` if the share-manager pod is in any state other than `Running` or `Pending`, unless
the share-manager is no longer required for this volume (no workloads consuming this volume).
When the share-manager `State` is `Error` the volume-controller will continuously set the volumes `RemountRequestedAt=Now` so that we will cleanup the workload pods
till the share-manager is back in order. This cleanup is to force the workload pods to initiate a new connection against the NFS server.
In the future we hope to reuse the NFS connection which will make this step no longer necessary.
The share-manager-controller will start a new share-manager pod on a different node and set `State=Starting`.
Once the pod is `Ready` the controller will set `State=Running` and the workload pods are now able to reconnect/remount again.
See above for the complete flow from `Starting -> Ready -> Running -> volume share available`
### Test plan
- [Manual tests for RWX feature](https://github.com/longhorn/longhorn-tests/pull/496)
- [E2E tests for RWX feature](https://github.com/longhorn/longhorn-tests/pull/512)
### Upgrade strategy
- User needs to add a new share-manager-image to airgapped environments
- Created a [migration job](https://longhorn.io/docs/1.1.0/advanced-resources/rwx-workloads/#migration-from-previous-external-provisioner) to allow users to migrate data from previous NFS provisioner or other RWO volumes
## Information & References
### NFS ganesha details
- [overview grace period and recovery](https://www.programmersought.com/article/19554210230/)
- [code grace period and recovery](https://www.programmersought.com/article/53201520803/)
- [code & architecture overview](https://www.programmersought.com/article/12291336147/)
- [kernel server vs ganesha](https://events.static.linuxfound.org/sites/events/files/slides/Collab14_nfsGanesha.pdf)
### NFS ganesha DBUS
- [DBUS-Interface](https://github.com/nfs-ganesha/nfs-ganesha/wiki/Dbusinterface#orgganeshanfsdadmin)
- [DBUS-Exports](https://github.com/nfs-ganesha/nfs-ganesha/wiki/DBusExports)
The dbus interface can be used to add & remove exports.
As well as make the server go into the grace period.
### NFS grace period
- [NFS_lock_recovery_notes](https://linux-nfs.org/wiki/index.php/NFS_lock_recovery_notes)
- [lower grace period](https://www.suse.com/support/kb/doc/?id=000019374)
### Ceph ganesha
- [rook crd for ganesha](https://github.com/rook/rook/blob/master/design/ceph/ceph-nfs-ganesha.md)
- [ganesha driver class library](https://docs.openstack.org/manila/rocky/contributor/ganesha.html)
### NFS ganesha Active Passive Setups
- [pacemaker & drbd, suse](https://documentation.suse.com/sle-ha/12-SP4/#redirectmsg)
- [pacemaker & corosync, redhat](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/ch-nfsserver-haaa)
### NFS ganesha recovery backends
Rados requires ceph, some more difference between kv/ng
- https://bugzilla.redhat.com/show_bug.cgi?id=1557465
- https://lists.nfs-ganesha.org/archives/list/devel@lists.nfs-ganesha.org/thread/DPULRQKCGB2QQUCUMOVDOBHCPJL22QMX/
**rados_kv**: This is the original rados recovery backend that uses a key-
value store. It has support for "takeover" operations: merging one
recovery database into another such that another cluster node could take
over addresses that were hosted on another node. Note that this recovery
backend may not survive crashes that occur _during_ a grace period. If
it crashes and then crashes again during the grace period, the server is
likely to fail to allow any clients to recover (the db will be trashed).
**rados_ng**: a more resilient rados_kv backend. This one does not support
takeover operations, but it should properly survive crashes that occur
during the grace period.
**rados_cluster**: the new (experimental) clustered recovery backend. This
one also does not support address takeover, but should be resilient
enough to handle crashes that occur during the grace period.
FWIW, the semantics for fs vs. fs_ng are similar. fs doesn't survive
crashes that occur during the grace period either.
Unless you're trying to use the dbus "grace" command to initiate address
takeover in an active/active cluster, you probably want rados_ng for
now.
### CSI lifecycle:
```
CreateVolume +------------+ DeleteVolume
+------------->| CREATED +--------------+
| +---+----^---+ |
| Controller | | Controller v
+++ Publish | | Unpublish +++
|X| Volume | | Volume | |
+-+ +---v----+---+ +-+
| NODE_READY |
+---+----^---+
Node | | Node
Stage | | Unstage
Volume | | Volume
+---v----+---+
| VOL_READY |
+---+----^---+
Node | | Node
Publish | | Unpublish
Volume | | Volume
+---v----+---+
| PUBLISHED |
+------------+
```
Figure 6: The lifecycle of a dynamically provisioned volume, from
creation to destruction, when the Node Plugin advertises the
STAGE_UNSTAGE_VOLUME capability.
### Previous Engine migration feature
https://github.com/longhorn/longhorn-manager/commit/2636a5dc6d79aa12116e7e5685ccd831747639df
https://github.com/longhorn/longhorn-tests/commit/3a55d6bfe633165fb6eb9553235b7d0a2e651cec