logo

Cloud Native Chaos Engineering with LitmusChaos

2022-05-19

Authors:   Saiyam Pathak, Uma Mukkara, Udit Gaurav


Summary

Cloud native chaos engineering is becoming more democratic and important for maintaining resilience in complex dynamic deployment environments.
  • Chaos engineering involves injecting failures into an environment to test the resilience of services and prevent sub-optimal behavior.
  • Cloud native chaos engineering is open source and community-collaborated.
  • Observability is important for customizing chaos engineering to an organization's needs.
  • Cloud native chaos engineering is becoming more democratic and involves a larger set of personas, including devops engineers and cloud native developers.
  • Cloud native chaos engineering is important for maintaining resilience in complex dynamic deployment environments with many moving parts.
Chaos engineering is like a game day for production environments. It involves injecting failures and observing the results to prevent sub-optimal behavior. In the past, it was typically done manually by SREs, but now it is becoming more democratic and involves a larger set of personas, including devops engineers and cloud native developers. Cloud native chaos engineering is important for maintaining resilience in complex dynamic deployment environments with many moving parts.

Abstract

The discipline of chaos engineering has evolved since it was introduced by Netflix a decade ago, mostly as a result of the cloud-native paradigm and the proliferation of Kubernetes as the universal control plane for today's distributed architecture. While the essence and basic principles of chaos remains the same, the way it is operationalized has undergone a paradigm shift, not limited to - the faults themselves, the environments where they are executed, the persona carrying out the experiments, as well as the methods to run them. LitmusChaos is a framework that has been designed to address these newer requirements and enable users to proactively identify weaknesses and improve resilience in their cloud-native setup. This session provides a deep-dive of the project, its goals and how it achieves them.Click here to view captioning/translation in the MeetingPlay platform!

Materials: