Effective Disaster Recovery: The Day We Deleted Production

Conference: KubeCon + CloudNativeCon Europe 2022

2022-05-18

Authors: Rick Spencer, Wojciech Kocjan

Summary

The presentation discusses the incident response and recovery process of a company's Kubernetes cluster. The company used Valero to recover the cluster after a Git revert caused data loss. The presentation also covers the company's decision to commit generated files in their CD process and the importance of prioritizing data protection.

Valero was used to recover the Kubernetes cluster after a Git revert caused data loss
Committing generated files in the CD process is a safer way to do things and allows for easier tracking of changes
Prioritizing data protection is crucial and should always be a top priority
The decision to use Valero or redeploy depends on the type of data and how often it changes

During the incident, the company had to make a conscious decision on whether to use Valero or redeploy the cluster. They ultimately chose to use Valero because the data in Zookeeper didn't change often. However, if the data was constantly changing, they would have lost all user data between the last Valero backup and the current state. The company also committed generated files in their CD process to ensure easier tracking of changes and prioritized data protection as their top priority.

Abstract

Imagine waking up to an sms, "we lost a cluster." On that day, with a one-line configuration change, we accidentally removed all of the compute from one of our busiest production clusters, causing a multi-hour outage. This presentation will cover the incident from the days leading up to it, to our full recovery, our customers' response to it, and how we implemented changes based on our learnings. It will go into detail about the configuration of our CI/CD pipeline, details about the specific change that caused the outage. Thankfully, we had a disaster recovery plan in place. We will discuss which parts of our disaster recovery plan worked, and critically, the few parts that didn't work. The session will cover a combination of technical and management content.Click here to view captioning/translation in the MeetingPlay platform!

Materials:

Tags: