From 43318cd24f1e6b759d3970763726448f44b5e5bf Mon Sep 17 00:00:00 2001 From: Joshua Moody Date: Thu, 2 Jul 2020 14:12:16 -0700 Subject: [PATCH] enhancements: Add LEP for backupstore file locks The idea is to implement a locking mechanism that utilizes the backupstore, to prevent the following dangerous cases of concurrent operations. 1. prevent backup deletion during backup restoration 2. prevent backup deletion while a backup is in progress 3. prevent backup creation during backup deletion 4. prevent backup restoration during backup deletion The locking solution shouldn't unnecessary block operations, so the following cases should be allowed. 1. allow backup creation during restoration 2. allow backup restoration during creation The locking solution should have a maximum wait time for lock acquisition, which will fail the backup operation so that the user does not have to wait forever. The locking solution should be self expiring, so that when a process dies unexpectedly, future processes are able to acquire the lock. The locking solution should guarantee that only a single type of lock is active at a time. The locking solution should allow a lock to be passed down into async running go routines. Longhorn #612 Signed-off-by: Joshua Moody --- .../20200701-backupstore-file-locks.md | 141 ++++++++++++++++++ 1 file changed, 141 insertions(+) create mode 100644 enhancements/20200701-backupstore-file-locks.md diff --git a/enhancements/20200701-backupstore-file-locks.md b/enhancements/20200701-backupstore-file-locks.md new file mode 100644 index 0000000..22598b0 --- /dev/null +++ b/enhancements/20200701-backupstore-file-locks.md @@ -0,0 +1,141 @@ +# Backupstore File Locks + +## Summary + +This enhancement will address backup issues that are the result of concurrently running backup operations, +by implementing a synchronisation solution that utilizes files on the backup store as Locks. + +### Related Issues + +https://github.com/longhorn/longhorn/issues/612 +https://github.com/longhorn/longhorn/issues/1393 +https://github.com/longhorn/longhorn/issues/1392 +https://github.com/longhorn/backupstore/pull/37 + +## Motivation + +### Goals + +Identify and prevent backup issues caused as a result of concurrent backup operations. +Since it should be safe to do backup creation & backup restoration at the same time, +we should allow these concurrent operations. + +## Proposal + +The idea is to implement a locking mechanism that utilizes the backupstore, +to prevent the following dangerous cases of concurrent operations. +1. prevent backup deletion during backup restoration +2. prevent backup deletion while a backup is in progress +3. prevent backup creation during backup deletion +4. prevent backup restoration during backup deletion + +The locking solution shouldn't unnecessary block operations, so the following cases should be allowed. +1. allow backup creation during restoration +2. allow backup restoration during creation + +The locking solution should have a maximum wait time for lock acquisition, +which will fail the backup operation so that the user does not have to wait forever. + +The locking solution should be self expiring, so that when a process dies unexpectedly, +future processes are able to acquire the lock. + +The locking solution should guarantee that only a single type of lock is active at a time. + +The locking solution should allow a lock to be passed down into async running go routines. + + +### User Experience In Detail + +Before this enhancement, it is possible to delete a backup while a backup restoration is in progress. +This would lead to an unhealthy restoration volume. + +After this enhancement, a backup deletion could only happen after the restoration has been completed. +This way the backupstore continues to contain all the necessary blocks that are required for the restoration. + +After this enhancement, creation & restoration operations are mutually exclusive with backup deletion operations. + +### API changes + +## Design +### Implementation Overview + +Conceptually the lock can be thought of as a **RW** lock, +it includes a `Type` specifier where different types are mutually exclusive. + +To allow the lock to be passed into async running go routines, we add a `count` field, +that keeps track of the current references to this lock. + +```go +type FileLock struct { + Name string + Type LockType + Acquired bool + driver BackupStoreDriver + volume string + count int32 + serverTime time.Time + refreshTimer *time.Ticker +} +``` + +To make the lock self expiring, we rely on `serverTime` updates which needs to be refreshed by a timer. +We chose a `LOCK_REFRESH_INTERVAL` of **60** seconds, each refresh cycle a locks `serverTime` will be updated. +A lock is considered expired once the current time is after a locks `serverTime` + `LOCK_MAX_WAIT_TIME` of **150** seconds. +Once a lock is expired any currently active attempts to acquire that lock will timeout. + +```go +const ( + LOCKS_DIRECTORY = "locks" + LOCK_PREFIX = "lock" + LOCK_SUFFIX = ".lck" + LOCK_REFRESH_INTERVAL = time.Second * 60 + LOCK_MAX_WAIT_TIME = time.Second * 150 + LOCK_CHECK_INTERVAL = time.Second * 10 + LOCK_CHECK_WAIT_TIME = time.Second * 2 +) +``` + +Lock Usage +1. create a new lock instance via `lock := lock.New()` +2. call `lock.Lock()` which will block till the lock has been acquired and increment the lock reference count. +3. defer `lock.Unlock()` which will decrement the lock reference count and remove the lock once unreferenced. + +To make sure the locks are **mutually exclusive**, we use the following process to acquire a lock. +1. create a lock file on the backupstore with a unique `Name`. +2. retrieve all lock files from the backupstore order them by `Acquired` then by `serverTime` + followed by `Name` +3. check if we can acquire the lock, we can only acquire if there is no unexpired(i) lock + of a different type(ii) that has priority(iii). + 1. Locks are self expiring, once the current time is after + `lock.serverTime + LOCK_MAX_WAIT_TIME` we no longer need to consider + this lock as valid. + 2. Backup & Restore Locks are mapped to compatible types while Delete + Locks are mapped to a different type to be mutually exclusive with the + others. + 3. Priority is based on the comparison order, where locks are compared by + `lock.Acquired` then by `lock.serverTime` followed by `lock.Name`. Where + acquired locks are always sorted before non acquired locks. +4. if lock acquisition times out, return err which will fail the backup operation. +5. once the lock is acquired, continuously refresh the lock (updates `lock.serverTime`) +5. once the lock is acquired, it can be passed around by calling `lock.Lock()` +6. once the lock is no longer referenced, it will be removed from the backupstore. + +It's very unlikely to run into lock collisions, since we use uniquely generated name for the lock filename. +In cases where two locks have the same `lock.serverTime`, we can rely on the `lock.Name` as a differentiator between 2 locks. + +### Test plan + +A number of integration tests will need to be added for the `longhorn-engine` in order to test the changes in this proposal: +1. place an expired lock file into a backupstore, then verify that a new lock can be acquired. +2. place an active lock file of Type `Delete` into a backupstore, + then verify that backup/restore operations will trigger lock acquisition timeout. +3. place an active lock file of Type `Delete` into a backupstore, + then verify that a new `Delete` operation can acquire a lock. +4. place an active lock file of Type `Backup/Restore` into a backupstore, + then verify that delete operations will trigger lock acquisition timeout. +5. place an active lock file of Type `Backup/Restore` into a backupstore, + then verify that a new `Backup/Restore` operation can acquire a lock. + +### Upgrade strategy + +No special upgrade strategy is necessary.