The presentation discusses a complex incident faced by Datadog in their Kubernetes environment, where they initially suspected DNS issues during rolling updates. However, after extensive debugging, they discovered that the issue was related to the connection tracking table used by the hypervisor in AWS instances.
- Datadog faced a complex incident in their Kubernetes environment
- Initially suspected DNS issues during rolling updates
- Extensive debugging revealed the issue was related to the connection tracking table used by the hypervisor in AWS instances
- Tried different instance types and sizes to address the issue
- Contacted AWS for more information on connection tracking limits
During the incident, Datadog noticed a spike in errors during rolling restarts of their metric service. Upon tracing the requests, they found DNS errors and suspected a DNS problem. However, after instrumenting the instances to look at low-level metrics, they discovered a metric called 'contract exceeded' that was completely correlated with deployments. This led them to investigate the connection tracking table used by the hypervisor in AWS instances.