logo

Disaster Recovery: Bringing Back Production from Scratch in Under 1 Hour Using KOps, ArgoCD and Velero - Andre Jay Marcelo

2023-04-20

Authors:   Andre Marcelo-Tanner


Summary

Lessons learned from a Kubernetes outage and disaster recovery process
  • Complete your migrations
  • Be experts in your tooling
  • Always be practicing your disaster recovery
The speaker shared a story of how their company experienced a Kubernetes outage due to an expired object storage policy. They had to manually fix outdated guides and processes to restore their services. The incident took two hours to resolve, involving multiple teams working together. The speaker emphasized the importance of being prepared for disaster recovery and constantly improving the process.

Abstract

This is a real life story of how our company had an operational incident caused by misconfiguration and the most reliable way to get everything working again was to rebuild our entire cluster from scratch. This was only possible because of the investments we had made into GitOps, ArgoCD, kOps and keeping our infrastructure as code. Out of nowhere our cluster was failing in a way we had never seen before. Our standard backup and recovery methods were not working, etcd was inaccessible, customers were down. Our last card was to recreate the entire cluster and re-install all our services. We had practiced it many times but this would be the first real disaster recovery. Would all our planning and migration to GitOps and ArgoCD pay off, could things be brought back to a healthy state? How fast could we do it? In the end we managed to recreate the cluster in 51 minutes and we learned a lot along the way. Many of the tools we invested in did not work as expected, disaster recovery guides were outdated and things we had never planned for occurred. We talk about the workarounds we had to employ, the work we had to do afterwards and how we plan to improve on this in the future (what we learned).

Materials: