The presentation discusses the challenges of distributed systems and how Kubernetes addresses them through its design choices. It also compares Kubernetes to other modern systems and explores real-world cases of failures.
- Distributed systems are challenging because failure is inevitable and requires designing systems to handle it gracefully.
- Kubernetes is designed to handle failure through fault tolerance and traffic routing.
- Other modern systems, such as Docker Swarm, HashiCorp Nomad, and K3s, have different approaches to handling failure.
- DistSys concepts such as CAP theorem, Gossip protocols, High Availability, and the RAFT consensus algorithm are discussed.
- Real-world cases, such as Target's 2019 cascading failure, are explored to illustrate the challenges of distributed systems.
- Understanding the problems confronting distributed systems and what 'correct' looks like is essential for designing and operating them effectively.
Target experienced a cascading failure in 2019 when an upgrade to their Kafka cluster caused intermittent network issues, triggering a thundering herd of Kubernetes systems that needed to be rescheduled. Over 41,000 nodes spun up quickly, adding to their service discovery before everything calmed down. This illustrates the challenges of automated systems and the importance of designing for failure.
Why is Kubernetes designed the way it is? Distributed systems are hard. That's the short and unsatisfying answer. The long answer is that like quantum mechanics, it's something that tends to make most of us uncomfortable. This talk is an introduction, and a way to build intuition about what makes a system like Kubernetes "correct." We'll do so by contrasting Kubernetes's design choices against other modern systems. We'll look at the implementation details of Docker Swarm, HashiCorp Nomad, K3s, and a "batteries included" Kubernetes distro like VMware Tanzu. In doing so, we'll discuss a number of distSys concepts. We'll learn about CAP theorem, Gossip protocols, High Availability (HA), and the RAFT consensus algorithm. Finally, we'll look at real world cases. Why do so many tools rely on Etcd's RAFT? What caused Target's 2019 cascading failure? Attendees will walk away with a better idea of the problems confronting distSys, and an intuition of what "correct" looks like.