14 KiB
Data Locality - Option To Keep A Local Replica On The Same Node As The Engine
Summary
A Longhorn volume can be backed by replicas on some nodes in the cluster and accessed by a pod running on any node in the cluster.
In the current implementation of Longhorn, the pod which uses Longhorn volume could be on a node that doesn't contain any replica of the volume.
In some cases, it is desired to have a local replica on the same node as the consuming pod.
In this document, we refer to the property of having a local replica as having data locality
This enhancement gives the users option to have a local replica on the same node as the engine which means on the same node as the consuming pod.
Related Issues
https://github.com/longhorn/longhorn/issues/1045
Motivation
Goals
Provide users an option to try to migrate a replica to the same node as the consuming pod.
Non-goals
Another approach to achieve data locality is trying to influence Kubernetes scheduling decision so that pods get scheduled onto the nodes which contain volume's replicas. However, this is not a goal in this LEP. See https://github.com/longhorn/longhorn/issues/1045 for more discussion about this approach.
Proposal
We give user 2 options for data locality setting: disabled
and best-effort
.
In disabled
mode, there may be a local replica of the volume on the same node as the consuming pod or there may not be.
Longhorn doesn't do anything.
In best-effort
mode, if a volume is attached to a node that has no replica, the Volume Controller will start rebuilding the replica on the node after the volume is attached.
Once the rebuilding process is done, it will remove one of the other replicas to keep the replica count as specified.
User Stories
Sometimes, having data locality
is critical.
For example, when the network is bad or the node is temporarily disconnected, having local replica will keep the consuming pod running.
Another case is that sometimes the application workload can do replication itself (e.g. database) and it wants to have a volume of 1 replica for each pod.
Without the data locality
feature, multiple replicas may end up on the same node which destroys the replication intention of the workload. See more in Story 1
In the current implementation of Longhorn, the users cannot ensure that pod will have a local replica.
After the enhancement implemented, users can have options to choose among disabled
(default setting) or best-effort
Story 1
A user has three hyper-converged nodes and default settings with: default-replica-count: 2
.
He wants to ensure a pod always runs with at least one local replica would reduce the amount of network traffic needed to keep the data in sync.
There does not appear to be an obvious way for him to schedule the pod using affinities.
Story 2
A user runs a database application that can do replication itself.
The database app creates multiple pods and each pod uses a Longhorn volume with replica-count = 1
.
The database application knows how to schedule pods into different nodes so that they achieve HA.
The problem is that replicas of multiple volumes could land on the same node which destroys the HA capability.
With the data locality
feature we can ensure that replicas are on the same nodes with the consuming pods and therefore they are on different nodes.
User Experience In Detail
- Users create a new volume using Longhorn UI with
dataLocality
set tobest-effort
. - If users attach the volume a node which doesn't contain any replica, they will see that Longhorn migrate a local replica to the node.
- Users create a storageclass with dataLocality: best-effort set
- Users launch a statefulset with the storageclass.
- Users will find that there is always a replica on the node where the pod resides on
- Users update
dataLocality
todisable
, detach the volume, and attach it to a node which doesn't have any replica - Users will see that Longhorn does not create a local replica on the new node.
API changes
There are 2 API changes:
- When creating a new volume, the body of the request sent to
/v1/volumes
has a new fielddataLocality
set to eitherdisabled
orbest-effort
. - Implement a new API for users to update
dataLocality
setting for individual volume. The new API could be/v1/volumes/<VOLUME_NAME>?action=updateDataLocality
. This API expects the request's body to have the form{dataLocality:<DATA_LOCALITY_MODE>}
.
Design
Implementation Overview
There are 2 modes for dataLocality
:
disabled
is the default mode. There may be a local replica of the volume on the same node as the consuming pod or there may not be. Longhorn doesn't do anything.best-effort
mode instructs Longhorn to try to keep a local replica on the same node as the consuming pod. If Longhorn cannot keep the local replica (due to not having enough disk space, incompatible disk tags, etc...), Longhorn does not stop the volume.
There are 3 settings the user can change for data locality
:
- Global default setting inside Longhorn UI settings. The global setting should only function as a default value, like replica count. It doesn't change any existing volume's setting
- specify
dataLocality
mode for individual volume upon creation using UI - specify
dataLocality
mode as a parameter on Storage Class.
Implementation steps:
- Add a global setting
DefaultDataLocality
- Add the new field
DataLocality
toVolumeSpec
- Modify the volume creation API so that it extracts, verifies, and sets the
dataLocality
mode for the new volume. If the volume creation request doesn't havedataLocality
field inside its body, we use theDefaultDataLocality
for the new volume. - Modify the
CreateVolume
function inside the CSI package so that it extracts, verifies, and sets thedataLocality
mode for the new volume. This makes sure that Kubernetes can use CSI to create Longhorn volume with a specifieddatLocality
mode. - Inside
volume controller
's sync logic, we add a new functionReconcileLocalReplica
. - When a volume enters the
volume controller
's sync logic, functionReconcileLocalReplica
checks thedataLocality
mode of the volume. If thedataLocality
isdisabled
, it will do nothing and return. - If the
dataLocality
isbest-effort
,ReconcileLocalReplica
checks whether there is a local replica on the same node as the volume.- If there is no local replica, we create an in-memory replica struct. We don't create a replica in DS using createReplica() directly because we may need to delete the new replica if it fails to ScheduleReplicaToNode. This prevents UI from repeatedly show creating/deleting the new replica. Then we try to schedule the replica struct onto the consuming pod's node. If the scheduling fails, we don't do anything. The replica struct will be collected by Go's garbage collector. If the scheduling success, we save the replica struct to the data store. This will trigger replica rebuilding on the consuming pod's node.
- If there already exists a local replica on the consuming pod's node, we check to see if there are more healthy replica than specified on the volume's spec. If there are more healthy replicas than specified on the volume's spec, we remove a replica on the other nodes. We prefer to delete replicas on the same disk, then replicas on the same node, then replicas on the same zone.
UI modification:
- On volume creation, add an input field for
dataLocality
- On volume detail page:
- On the right volume info panel, add a to display
selectedVolume.dataLocality
- On the right volume panel, in the Health row, add an icon for data locality status.
Specifically, if
dataLocality=best-effort
but there is not a local replica then display a warning icon. Similar to the replica node redundancy warning here - In the volume's actions dropdown, add a new action to update
dataLocality
- On the right volume info panel, add a
- In Rancher UI, add a parameter
dataLocality
when create storage class using Longhorn provisioner.
Test plan
Manually Test Plan
- Create a cluster of 9 worker nodes and install Longhorn. Having more nodes helps us to be more confident because the chance of randomly scheduling a replica onto the same node as the engine is small.
Test volume creation with dataLocality
is best-effort
:
- Create volume
testvol
withNumber of Replicas = 2
anddataLocality
isbest-effort
- Attach
testvol
to a node that doesn't contain any replica. - Verify that Longhorn schedules a local replica to the same node as the consuming pod. After finishing rebuilding the local replica. Longhorn removes a replica on other nodes to keep the number of replicas is 2.
Test volume creation with dataLocality
is disabled
:
- Create another volume,
testvol2
withNumber of Replicas = 2
anddataLocality
isdisabled
- Attach
testvol2
to a node that doesn't contain any replica. - Verify that Longhorn doesn't move replica
Test volume creation with dataLocality
is unspecified and DefaultDataLocality
setting as disabled
:
- Leave the
DefaultDataLocality
setting asdisabled
in Longhorn UI. - Create another volume,
testvol3
withNumber of Replicas = 2
anddataLocality
is empty - Attach
testvol3
to a node that doesn't contain any replica. - Verify that the
dataLocality
oftestvol3
isdisabled
and that Longhorn doesn't move replica.
Test volume creation with dataLocality
is unspecified and DefaultDataLocality
setting as best-effort
:
- Set the
DefaultDataLocality
setting tobest-effort
in Longhorn UI. - Create another volume,
testvol4
withNumber of Replicas = 2
anddataLocality
is empty - Attach
testvol4
to a node that doesn't contain any replica. - Verify that the
dataLocality
oftestvol4
isbest-effort
. - Verify that Longhorn schedules a local replica to the same node as the consuming pod. After finishing rebuilding the local replica. Longhorn removes a replica on other nodes to keep the number of replicas is 2.
Test updateDataLocality
from disabled
to best-effort
:
- Change
dataLocality
tobest-effort
fortestvol2
- Verify that Longhorn schedules a local replica to the same node as the consuming pod. After finishing rebuilding the local replica. Longhorn removes a replica on other nodes to keep the number of replicas which is 2.
Test updateDataLocality
from best-effort
to disabled
:
- Change
dataLocality
todisabled
fortestvol2
- Go to Longhorn UI, increase the
number of replicas
to 3. Wait until the new replica finishes rebuilding. - Delete the local replica on the same node as the consuming pod.
- Verify that Longhorn doesn't move replica
Test volume creation by using storage class with dataLocality
parameter is disabled
:
- Create
disabled-longhorn
storage class with from this yaml file:kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: disabled-longhorn provisioner: driver.longhorn.io allowVolumeExpansion: true parameters: numberOfReplicas: "1" dataLocality: "disabled" staleReplicaTimeout: "2880" # 48 hours in minutes fromBackup: ""
- create a deployment of 1 pod using PVC dynamically created by
disabled-longhorn
storage class. - The consuming pod is likely scheduled onto a different node than the replica. If this happens, verify that Longhorn doesn't move replica
Test volume creation by using storage class with dataLocality
parameter is best-effort
:
- Create
best-effort-longhorn
storage class with from this yaml file:kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: best-effort-longhorn provisioner: driver.longhorn.io allowVolumeExpansion: true parameters: numberOfReplicas: "1" dataLocality: "best-effort" staleReplicaTimeout: "2880" # 48 hours in minutes fromBackup: ""
- create a shell deployment of 1 pod using the PVC dynamically created by
best-effort-longhorn
storage class. - The consuming pod is likely scheduled onto a different node than the replica.
- If this happens, verify that Longhorn schedules a local replica to the same node as the consuming pod. After finishing rebuilding the local replica, Longhorn removes a replica on other nodes to keep the number of replicas which is 1.
- verify that the volume CRD has
dataLocality
isbest-effort
Test volume creation by using storage class with dataLocality
parameter is unspecified`:
- Create
unspecified-longhorn
storage class with from this yaml file:kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: unspecified-longhorn provisioner: driver.longhorn.io allowVolumeExpansion: true parameters: numberOfReplicas: "1" staleReplicaTimeout: "2880" # 48 hours in minutes fromBackup: ""
- create a shell deployment of 1 pod using PVC dynamically created by
unspecified-longhorn
storage class. - The consuming pod is likely scheduled onto a different node than the replica.
- If this happens, depend on
DefaultDataLocality
setting in Longhorn UI, verify that Longhorn does/doesn't migrate a local replica to the same node as the consuming pod.
Tests for the volumes created in old versions:
- The volumes created in old Longhorn versions don't have the field
dataLocality
. - We treat those volumes the same as having
dataLocality
set todisabled
- Verify that Longhorn doesn't migrate replicas for those volumes.
Upgrade strategy
No special upgrade strategy is required.
We are adding the new field, dataLocality
, to volume CRD's spec.
Then we use this field to check whether we need to migrate a replica to the same node as the consuming pod.
When users upgrade Longhorn to this new version, it is possible that some volumes don't have this field.
This is not a problem because we only migrate replica when dataLocality
is best-effort
.
So, the empty dataLocality
field is fine.