Skip to content

Disaster recovery

Given access to the IndeVets/core GitHub repository and a recent database snapshot, the entire CATS application and IndeVets data warehouse can be restored from scratch in less than an hour.

The complete current deployed state of the IndeVets Kubernetes cluster is tracked in the deploys/k8s-manifests branch. It can be deployed to a fresh Kubernetes cluster hosted by Linode, Digital Ocean, GCP, or Azure. The Create Cluster article walks through the process for Linode clusters, and can be adapted to any other provider given that an equivalent set of node pools get created and have labels applied.

The Create Cluster article covers restoring data from a database snapshot or another database instance. Database snapshots are taken hourly and stored in a cloud storage bucket as an encrypted restic repository. Credentials for connecting to the cloud storage bucket and decrypting the snapshots are shared in the BitWarden vault, and should be backed up to an additional secure credential archive by key business users. Online primary and replica databases can be used as a more up-to-date source if available.

Testing process

On a quarterly basis:

  • The engineering team should verify that they can restore IndeVets to production from source + backups within 30 minutes
  • The business should contract a fresh 3rd-party technical consultant to verify that, provided access to GitHub and a database snapshot, they can restore the system to production operation from documentation within 2 hours.
    • The 3rd-party technical consultant may further verify that they can restore the software development lifecycle and successfully deploy a change to production within a furthur 8 hours of work.

Incidents and tests

2021-06-21 Incident

2021-08-08 Test

  • Result: new cluster provisioned and data restored within 15 minutes
  • Corrective actions:
    • Documentation streamlined
    • Node affinities and tolerations migrated from using provider-assigned pool ID labels to using IndeVets-assigned semantic labels to provide simpler portability between cloud providers and clusters