logo

Logs Told Us It Was DNS, It Felt Like DNS, It Had To Be DNS, It Wasn’t DNS

2022-05-20

Authors:   Laurent Bernaille, Elijah Andrews


Summary

The presentation discusses a complex incident faced by Datadog in their Kubernetes environment, where they initially suspected DNS issues during rolling updates. However, after extensive debugging, they discovered that the issue was related to the connection tracking table used by the hypervisor in AWS instances.
  • Datadog faced a complex incident in their Kubernetes environment
  • Initially suspected DNS issues during rolling updates
  • Extensive debugging revealed the issue was related to the connection tracking table used by the hypervisor in AWS instances
  • Tried different instance types and sizes to address the issue
  • Contacted AWS for more information on connection tracking limits
During the incident, Datadog noticed a spike in errors during rolling restarts of their metric service. Upon tracing the requests, they found DNS errors and suspected a DNS problem. However, after instrumenting the instances to look at low-level metrics, they discovered a metric called 'contract exceeded' that was completely correlated with deployments. This led them to investigate the connection tracking table used by the hypervisor in AWS instances.

Abstract

It all started with a team reaching out because they had DNS issues during rolling updates. Business as usual when you host hundreds of applications on dozens of Kubernetes clusters… Four weeks later: We are reading kernel code to understand the corner cases of dropping Martian packets. Could this be the connection between gRPC client reconnect algorithms and the overflowing conntrack table we can feel but not see? In time, we solved the issue. And for once… it wasn't DNS! In this talk, we will focus on one of the most complex incidents we have faced in our Kubernetes environment. We will go through the debugging steps in detail, dive deep into the mysterious behaviors we discovered and explain how we finally addressed the incident by simply removing three lines of code.Click here to view captioning/translation in the MeetingPlay platform!

Materials: