Longhorn will leave the failed backups behind and will not delete the backups automatically either until the backup target is removed. Failed backup cleanup will be occurred when making a backup to remote backup target failed. This LEP will trigger the deletion of failed backups automatically.
### Related Issues
[[IMPROVEMENT] Support failed/obsolete orphaned backup cleanup](https://github.com/longhorn/longhorn/issues/3898)
- Support the global auto-deletion option of failed backups cleanup for users.
- The process should not be stuck in the reconciliation of the controllers.
### Non-goals [optional]
- Clean up unknown files or directories on the remote backup target.
## Proposal
1. The `backup_volume_controller` will be responsible for deleting Backup CR when there is a backup which state is in `Error` or `Unknown`.
The reconciliation procedure of the `backup_volume_controller` gets the latest failed backups from the datastore and delete the failed backups.
```text
queue ┌───────────────┐
┌┐ ┌┐ ┌┐ │ │
... ││ ││ ││ ──────► │ syncHandler() |
└┘ └┘ └┘ │ │
└───────┬───────┘
│
┌──────────▼───────────┐
│ │
│ reconcile() |
│ │
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ │
│ get failed backups │
│ |
| then delete them │
│ │
└──────────────────────┘
```
### User Stories
When a user or recurring job tries to make a backup and store it in the remote backup target, many situations will cause the backup procedure failed. In some cases, there will be some failed backups still staying in the Longhorn system and this kind of backups are not handled by the Longhorn system until user removes the backup target. Or users can manage the failed backups via Longhorn GUI or command line tools manually.
After the enhancement, Longhorn can delete the failed backups automatically after enabling auto-deletion.
### User Experience In Detail
- Via Longhorn GUI
- Users can be aware of that backup was failed if auto-deletion is disabled.
- Users can check the event log to understand why the backup failed and deleted.
-`backups` CRs with `Error` or `Unknown` state will be removed by `backup_volume_controller` triggered by backupstore polling when the `backup_monitor` detects the backup failed.
-`backups` CRs with `Error` or `Unknown` state will not be handled if the auto-deletion is disabled.
1. We already have the backup CR to handle the backup resources and failed backup is not like orphaned replica which is not owned by any volume at the beginning.
2. Cascading deletion of orphaned CR and backup CR would be more complicated than we just handle the failed backups immediately when backup procedure failed. Both in this LEP or orphan framework we would delete the failed backups by `backup_volume_controller`.
3. Listing orphaned backups and failed backups on both two UI pages `Orphaned Data` and `Backup` might be a bit confusing for users. Deleting items manually on either of two pages would be involved in what it mentioned at statement 2.