On the Hunt for Etcd Data Inconsistencies

Conference: KubeCon + CloudNativeCon Europe 2023

2023-04-19

Authors: Marek Siarkowicz

Summary

The presentation discusses model-based testing for verifying the correctness of distributed systems, using HCD as an example. The model is simplified but can get complicated, and the testing is fragile if there are bugs or optimizations. The presentation also mentions the possibility of generalizing the model-based testing beyond HCD.

Model-based testing is great for testing generic approaches to correctness and separates validation from execution
The model can be simplified but can get complicated, and the testing is fragile if there are bugs or optimizations
The state increases exponentially, making the test fragile
The model can be generalized beyond HCD
The testing can validate the operations or the model and generate a report
The presentation includes an anecdote about using fail points to test HCD and finding a durability issue

The presentation includes an anecdote about using fail points to test HCD and finding a durability issue. The testing involved starting an HD server, checking its health, injecting a fail point, and doing a strict validation. The test suite used special library to tell HCD to crash at a specific point. The test was able to simulate and validate traffic and report all the important information for verifying whether the problem was with HCD or the test suite or the model. The visualization showed a durability issue where a put request was never persisted, and all the following requests had a lower revision than what the client recorded.

Abstract

Many things can go wrong in a distributed system, making conventional testing techniques ineffective in preventing serious and subtle bugs. Even for mature systems like etcd, built on the reliable Raft foundations, bugs are inevitable. Last year the etcd community discovered 4 critical issues including data inconsistencies and lost durability that managed to pass our tests and a rigorous code review. Unfortunately, the testing methodology used by the etcd project was insufficient to detect such problems. So to prevent such issues in the future we needed a new approach. Over the course of 6 months the etcd community built a new testing framework that retroactively detected all issues that were found manually and on top of that identified a new issue. This presentation will discuss how the etcd project has adopted model testing methodology to weed out data inconsistency bugs in etcd and prevent such issues in the future.

Materials:

Tags: