From 631ddeb2ac90ad8db85902b7891ba74eac396caa Mon Sep 17 00:00:00 2001
From: Sheng Yang <sheng@yasker.org>
Date: Thu, 31 Jan 2019 18:53:14 -0800
Subject: [PATCH] Create node-failure.md

---
 docs/node-failure.md | 15 +++++++++++++++
 1 file changed, 15 insertions(+)
 create mode 100644 docs/node-failure.md

diff --git a/docs/node-failure.md b/docs/node-failure.md
new file mode 100644
index 0000000..2e6c977
--- /dev/null
+++ b/docs/node-failure.md
@@ -0,0 +1,15 @@
+# Node Failure Handling with Longhorn
+
+## What to expect when a Kubernetes Node fails
+
+When a Kubernetes node fails with CSI driver installed (all the following are based on Kubernetes v1.12 with default setup):
+1. After **one minute**, `kubectl get nodes` will report `NotReady` for the failure node.
+2. After about **five minutes**, the states of all the pods on the `NotReady` node will change to either `Unknown` or `NodeLost`.
+3. If you're deploying using StatefulSet or Deployment, you need to decide is if it's safe to force deletion the pod of the workload
+running on the lost node. See [here](https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/).
+    1. StatefulSet has stable identity, so Kubernetes won't delete the Pod for the user.
+    2. Deployment doesn't have stable identity, but Longhorn is a Read-Write-Once type of storage, which means it can only attached
+    to one Pod. So the new Pod created by Kubernetes won't be able to start due to the Longhorn volume still attached to the old Pod,
+    on the lost Node.
+4. If you decide to delete the Pod manually (and forcefully), Kubernetes will take about another **six minutes** to delete the VolumeAttachment
+object associated with the Pod, thus finally detach the Longhorn volume from the lost Node and allow it to be used by the new Pod.