From e7c9e7c197747f14a98581db16373b8a71e83eeb Mon Sep 17 00:00:00 2001 From: Shuo Wu Date: Fri, 17 May 2019 18:15:03 +0000 Subject: [PATCH] Add doc for disaster recovery volume Longhorn issue 535 --- README.md | 1 + docs/dr-volume.md | 53 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 54 insertions(+) create mode 100644 docs/dr-volume.md diff --git a/README.md b/README.md index 0ca5802..a745512 100644 --- a/README.md +++ b/README.md @@ -211,6 +211,7 @@ More examples are available at `./examples/` ### [Deal with Kubernetes node failure](./docs/node-failure.md) ### [Use CSI driver on RancherOS/CoreOS + RKE or K3S](./docs/csi-config.md) ### [Restore a backup to an image file](./docs/restore-to-file.md) +### [Disaster Recovery Volume](./docs/dr-volume.md) # Troubleshooting You can click `Generate Support Bundle` link at the bottom of the UI to download a zip file contains Longhorn related configuration and logs. diff --git a/docs/dr-volume.md b/docs/dr-volume.md new file mode 100644 index 0000000..afdf0b5 --- /dev/null +++ b/docs/dr-volume.md @@ -0,0 +1,53 @@ +# Disaster Recovery Volume +## What is Disaster Recovery Volume? +To increase the resiliency of the volume, Longhorn supports disaster recovery volume. + +The disaster recovery volume is designed for the backup cluster in the case of the whole main cluster goes down. +A disaster recovery volume is normally in standby mode. User would need to activate it before using it as a normal volume. +A disaster recovery volume can be created from a volume's backup in the backup store. And Longhorn will monitor its +original backup volume and incrementally restore from the latest backup. Once the original volume in the main cluster goes +down and users decide to activate the disaster recovery volume in the backup cluster, the disaster recovery volume can be +activated immediately in the most condition, so it will greatly reduced the time needed to restore the data from the +backup store to the volume in the backup cluster. + +## How to create Disaster Recovery Volume? +1. In the cluster A, make sure the original volume X has backup created or recurring backup scheduling. +2. Set backup target in cluster B to be same as cluster A's. +3. In backup page of cluster B, choose the backup volume X then create disaster recovery volume Y. It's highly recommended +to use backup volume name as disaster volume name. +4. Attach the disaster recovery volume Y to any node. Then Longhorn will automatically polling for the last backup of the +volume X, and incrementally restore it to the volume Y. +5. If volume X is down, users can activate volume Y immediately. Once activated, volume Y will become a +normal Longhorn volume. + 5.1. Notice that deactivate a normal volume is not allowed. + +## About Activating Disaster Recovery Volume +1. A disaster recovery volume doesn't support creating/deleting/reverting snapshot, creating backup, creating +PV/PVC. Users cannot update `Backup Target` in Settings if any disaster recovery volumes exist. + +2. When users try to activate a disaster recovery volume, Longhorn will check the last backup of the original volume. If +it hasn't been restored, the restoration will be started, and the activate action will fail. Users need to wait for +the restoration to complete before retrying. + +3. For disaster recovery volume, `Last Backup` indicates the most recent backup of its original backup volume. If the icon +representing disaster volume is gray, it means the volume is restoring `Last Backup` and users cannot activate this +volume right now; if the icon is blue, it means the volume has restored the `Last Backup`. + +## RPO and RTO +Typically incremental restoration is triggered by the periodic backup store update. Users can set backup store update +interval in `Setting - General - Backupstore Poll Interval`. Notice that this interval can potentially impact +Recovery Time Objective(RTO). If it is too long, there may be a large amount of data for the disaster recovery volume to +restore, which will take a long time. As for Recovery Point Objective(RPO), it is determined by recurring backup +scheduling of the backup volume. You can check [here](snapshot-backup.md) to see how to set recurring backup in Longhorn. + +e.g.: + +If recurring backup scheduling for normal volume A is creating backup every hour, then RPO is 1 hour. + +Assuming the volume creates backup every hour, and incrementally restoring data of one backup takes 5 minutes. + +If `Backupstore Poll Interval` is 30 minutes, then there will be at most one backup worth of data since last restoration. +The time for restoring one backup is 5 minute, so RTO is 5 minutes. + +If `Backupstore Poll Interval` is 12 hours, then there will be at most 12 backups worth of data since last restoration. +The time for restoring the backups is 5 * 12 = 60 minutes, so RTO is 60 minutes. \ No newline at end of file