Add doc for disaster recovery volume

Longhorn issue 535
2019-05-17 18:15:03 +00:00 · 2019-05-17 18:15:03 +00:00 · e7c9e7c197
commit e7c9e7c197
parent 5a7a3cb755
2 changed files with 54 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -211,6 +211,7 @@ More examples are available at `./examples/`
 ### [Deal with Kubernetes node failure](./docs/node-failure.md)
 ### [Use CSI driver on RancherOS/CoreOS + RKE or K3S](./docs/csi-config.md)
 ### [Restore a backup to an image file](./docs/restore-to-file.md)
+### [Disaster Recovery Volume](./docs/dr-volume.md)

 # Troubleshooting
 You can click `Generate Support Bundle` link at the bottom of the UI to download a zip file contains Longhorn related configuration and logs.
--- a/docs/dr-volume.md
+++ b/docs/dr-volume.md
@ -0,0 +1,53 @@
+# Disaster Recovery Volume
+## What is Disaster Recovery Volume?
+To increase the resiliency of the volume, Longhorn supports disaster recovery volume.
+ 
+The disaster recovery volume is designed for the backup cluster in the case of the whole main cluster goes down. 
+A disaster recovery volume is normally in standby mode. User would need to activate it before using it as a normal volume.
+A disaster recovery volume can be created from a volume's backup in the backup store. And Longhorn will monitor its 
+original backup volume and incrementally restore from the latest backup. Once the original volume in the main cluster goes
+down and users decide to activate the disaster recovery volume in the backup cluster, the disaster recovery volume can be
+activated immediately in the most condition, so it will greatly reduced the time needed to restore the data from the
+backup store to the volume in the backup cluster.
+
+## How to create Disaster Recovery Volume?
+1. In the cluster A, make sure the original volume X has backup created or recurring backup scheduling.
+2. Set backup target in cluster B to be same as cluster A's.
+3. In backup page of cluster B, choose the backup volume X then create disaster recovery volume Y. It's highly recommended
+to use backup volume name as disaster volume name.
+4. Attach the disaster recovery volume Y to any node. Then Longhorn will automatically polling for the last backup of the
+volume X, and incrementally restore it to the volume Y.
+5. If volume X is down, users can activate volume Y immediately. Once activated, volume Y will become a 
+normal Longhorn volume.
+    5.1. Notice that deactivate a normal volume is not allowed.
+
+## About Activating Disaster Recovery Volume
+1. A disaster recovery volume doesn't support creating/deleting/reverting snapshot, creating backup, creating
+PV/PVC. Users cannot update `Backup Target` in Settings if any disaster recovery volumes exist.
+
+2. When users try to activate a disaster recovery volume, Longhorn will check the last backup of the original volume. If
+it hasn't been restored, the restoration will be started, and the activate action will fail. Users need to wait for 
+the restoration to complete before retrying.
+
+3. For disaster recovery volume, `Last Backup` indicates the most recent backup of its original backup volume. If the icon 
+representing disaster volume is gray, it means the volume is restoring `Last Backup` and users cannot activate this 
+volume right now; if the icon is blue, it means the volume has restored the `Last Backup`. 
+
+## RPO and RTO
+Typically incremental restoration is triggered by the periodic backup store update. Users can set backup store update 
+interval in `Setting - General - Backupstore Poll Interval`. Notice that this interval can potentially impact 
+Recovery Time Objective(RTO). If it is too long, there may be a large amount of data for the disaster recovery volume to 
+restore, which will take a long time. As for Recovery Point Objective(RPO), it is determined by recurring backup 
+scheduling of the backup volume. You can check [here](snapshot-backup.md) to see how to set recurring backup in Longhorn.
+
+e.g.:
+
+If recurring backup scheduling for normal volume A is creating backup every hour, then RPO is 1 hour.
+
+Assuming the volume creates backup every hour, and incrementally restoring data of one backup takes 5 minutes.  
+
+If `Backupstore Poll Interval` is 30 minutes, then there will be at most one backup worth of data since last restoration.
+The time for restoring one backup is 5 minute, so RTO is 5 minutes.
+
+If `Backupstore Poll Interval` is 12 hours, then there will be at most 12 backups worth of data since last restoration.
+The time for restoring the backups is 5 * 12 = 60 minutes, so RTO is 60 minutes.