logo

Cloudy With a Chance Of Chaos: Verifying the Resiliency Of Cloud-Native Applications

2022-10-27

Authors:   Bella Wiseman


Summary

Chaos engineering is essential for verifying the resiliency of cloud-native applications, but traditional techniques are often not viable for managed services in the cloud. Defining success criteria, measuring steady state, injecting failure, observing outcomes, and restoring the system are the five parts to running a chaos test.
  • Chaos engineering involves deliberately causing production incidents to determine the impact on the environment.
  • Defining success criteria is important to determine what success means for the test.
  • Measuring steady state involves observing how the system behaves when everything is going well.
  • Injecting failure is the actual chaos test itself.
  • Observing outcomes involves seeing what happens during the chaos test.
  • Restoring the system is necessary if required.
  • Managed services in the cloud make traditional chaos engineering techniques difficult.
  • Chaos engineering is essential for verifying the resiliency of cloud-native applications.
  • Defining success criteria, measuring steady state, injecting failure, observing outcomes, and restoring the system are the five parts to running a chaos test.
The speaker discussed a case study of a chaos experiment conducted on a real Goldman Sachs system. They found that monitoring should be decoupled from the service and that dashboards should be tested regularly with real incident-like scenarios. They also emphasized the importance of monitoring from the customer's perspective using synthetic probes.

Abstract

Interest in chaos engineering has exploded over the last few years, with more and more organizations looking to adopt the practice.  But as those same organizations shift to using managed services in the cloud, traditional chaos engineering techniques are often no longer viable. Powering down a machine is a simple, powerful, and versatile way to uniformly inject failure across all types of applications.  But today, when we build cloud native apps, we often choose to use managed services that provide a layer of abstraction on top of the underlying machines.  How can we inject realistic chaos when we have no access to the underlying machines?  Join Bella Wiseman of Goldman Sachs, as she discusses chaos engineering essentials, chaos on the cloud, and a real-life case study of a chaos engineering experiment at Goldman Sachs.

Materials: