Chaos engineering is essential for verifying the resiliency of cloud-native applications, but traditional techniques are often not viable for managed services in the cloud. Defining success criteria, measuring steady state, injecting failure, observing outcomes, and restoring the system are the five parts to running a chaos test.
- Chaos engineering involves deliberately causing production incidents to determine the impact on the environment.
- Defining success criteria is important to determine what success means for the test.
- Measuring steady state involves observing how the system behaves when everything is going well.
- Injecting failure is the actual chaos test itself.
- Observing outcomes involves seeing what happens during the chaos test.
- Restoring the system is necessary if required.
- Managed services in the cloud make traditional chaos engineering techniques difficult.
- Chaos engineering is essential for verifying the resiliency of cloud-native applications.
- Defining success criteria, measuring steady state, injecting failure, observing outcomes, and restoring the system are the five parts to running a chaos test.
The speaker discussed a case study of a chaos experiment conducted on a real Goldman Sachs system. They found that monitoring should be decoupled from the service and that dashboards should be tested regularly with real incident-like scenarios. They also emphasized the importance of monitoring from the customer's perspective using synthetic probes.