When the Logs Just Don’t Cut It: Root-Causing Incidents Without Re-Deploying Prod

Conference: KubeCon + CloudNativeCon North America 2022

2022-10-28

Authors: Phillip Kuznetsov

Summary

The presentation discusses how to root-cause incidents without redeploying production using bpftrace, a tool that captures useful data without restarting pods. The speaker demonstrates how to work with bpftrace on Kubernetes and shares tips and tricks for using Pixie to deploy and collect data from bpftrace scripts.

The speaker presents a scenario where the front-end service of an e-commerce company is panicking and the root cause is unknown
The speaker explains the need for a sum function to add money values in the service and the difficulty in identifying invalid money values
The speaker introduces bpftrace as a tool to capture useful data without redeploying pods
The speaker shares tips and tricks for working with bpftrace on Kubernetes, including using Pixie to deploy and collect data from bpftrace scripts

The speaker uses the scenario of an e-commerce company experiencing a front-end service panic to illustrate the difficulty of identifying root causes in microservices. They explain the need for a sum function to add money values and the challenge of identifying invalid money values. The speaker then demonstrates how bpftrace can be used to capture useful data without redeploying pods and shares tips and tricks for working with bpftrace on Kubernetes using Pixie.

Abstract

We’ve all been there: your pod is crash-looping, you check the logs and you realize you forgot to log something important - now you’re unable to figure out what went wrong. You try to reproduce the problem locally with no luck: it only seems to happen in production. What do you do? Do you re-deploy to production with more print statements? You could burn hours doing that while you risk more problems. What if you could instead get that same data without the headache of restarting prod? In this talk, I’ll show you how to magically collect this data using bpftrace. Bpftrace lets you capture lots of useful data (function arguments, return values, latencies of individual functions - just to name a few) without re-deploying pods. Bpftrace is very powerful, but can be complex to work with, especially in multi-node environments like a Kubernetes cluster. I’ll show you how to cut past these problems by walking through a demo incident. I’ll show you some tips and tricks for working with bpftrace on Kubernetes, including how to leverage Pixie to easily deploy and collect data from bpftrace scripts.

Materials:

Tags: