logo

How a Couple of Characters (and GitOps) Brought Down Our Site

2022-05-19

Authors:   Guy Templeton, Stuart Davidson


Summary

The importance of a blameless incident culture in DevOps and incident management
  • An experienced engineer made a mistake in the root file of the templating system, which was missed by a senior engineer during review
  • Skyscanner promotes a blameless attitude towards incidents and emphasizes the importance of a blameless culture
  • Incident management involves all responsible parties, including legal and user satisfaction teams
  • The incident commander role is crucial in coordinating technical and non-technical teams during an incident
  • Restoring services after an incident can be complicated and time-consuming
During the incident, a non-technical incident commander was involved in coordinating communication with CXOs and legal teams, allowing technical teams to focus on fixing the issue

Abstract

Skyscanner have been enthusiastic adopters of Cloud-Native technologies and practices, adopting Kubernetes, Helm and ArgoCD as well as a wide range of other open-source technologies. However, adopting these technologies and practices in an existing environment doesn’t come without challenges. In this talk, Stuart and Guy will walk you through the longer-term cultural and technical challenges and benefits brought by adopting a GitOps model, as well as digging deeper into a global outage of Skyscanner’s website and mobile apps and how these approaches both exacerbated the problem but also sped up the time to resolution. They’ll then take the opportunity to explain some of the learnings from the incident with the hope that the insight they gained from this catastrophic situation will help you and your organisation not make the same mistakes.Click here to view captioning/translation in the MeetingPlay platform!

Materials: