Improving observability and reliability in a multi-cluster environment through infrastructure as code and custom metrics
- Investing in observability and reliability preemptively before experiencing issues
- Using infrastructure as code, specifically Terraform and Argo CD, to manage multi-cluster deployments and ensure consistency
- Creating custom metrics, such as Kubeflow state metrics, to track specific product needs and enable effective SLOs and alerts
The team noticed that many out-of-the-box metrics provided by dependencies like Istio and Kubernetes were too general and did not enable them to track what was critical to their product offering. They implemented Kubeflow state metrics to create custom metrics that specifically targeted their use cases, such as tracking how long Kubeflow pipeline pods take to start running and how long they stay on the cluster once execution concludes. These metrics were then used downstream in SLOs, alerts, and dashboards.
Spotify began offering a centralized Kubeflow Pipelines product to its machine learning teams around two years ago. Since then, adoption has skyrocketed, with more teams training more models and running increasingly complex experiments. These increased demands on our system come with more stringent demands on us, the Kubeflow team at Spotify, to ensure not just cluster reliability, but cluster equitability. Our job is to not just be cluster maintainers, but cluster stewards—ensuring equitable and reliable access to cluster resources, and keeping users from stepping on each others’ toes. In this talk, we’ll discuss our streamlined tooling to maintain, deploy, and monitor Spotify’s distribution of Kubeflow. We’ll illustrate the challenges we face as we scale to increased user load and increasingly distinct and demanding pipelines, and outline our approach to addressing those challenges with “multi-cluster” Kubeflow. Finally, we’ll give a preview of our future plans for the platform.