logo

Experience with “Hard Multi-Tenancy” in Kubernetes Using Kata Containers

2023-04-19

Authors:   Shuo Chen


Summary

Databricks uses Kata Containers for hard multi-tenancy in Kubernetes clusters to provide strong isolation for performance-sensitive workloads such as Data Lakehouse. The case study discusses the challenges faced, trade-offs among security, performance, and cost, and how to work around the heterogeneity across different public cloud providers.
  • Databricks is building a serverless platform for performance-sensitive workloads such as Data Lakehouse on Kubernetes clusters
  • They need hard multi-tenant container isolation since each cluster runs code on behalf of multiple customers
  • They chose Kata Containers, an open-source container runtime that provides strong isolation by running containers in micro-VMs
  • They built a hard compute and network isolation layer among untrusted workloads in Kubernetes clusters leveraging Kata Containers, network policy, and network security group
  • They share their first-hand experience on how they integrate Kata Containers with Kubernetes in production, highlighting the challenges they faced, difficult trade-offs among security, performance, and cost, and how to work around the heterogeneity across different public cloud providers
Databricks faced potential risks such as resource contention and additional infrastructure costs when using traditional Kubernetes environments. They found that the container boundary might not be good enough since customers can run arbitrary code, and the network bandwidth from the same NIC might cause performance variation. They also had to use large machines and allocate multiple Kata VMs on top of a single machine, which caused fragmentation and more infrastructure costs. They had to fine-tune the performance of Kata Containers to reach similar performance levels to Native technology and make the infrastructure consistent, performance, and cost-efficient.

Abstract

Databricks is building a serverless platform for performance-sensitive workloads such as Data Lakehouse on Kubernetes clusters. Because each cluster runs code on behalf of multiple customers, we need “hard multi-tenant” container isolation. After considering various options we chose Kata Containers, an open-source container runtime that provides strong isolation by running containers in micro-VMs. This case study discusses how we build a hard compute and network isolation layer among untrusted workloads in Kubernetes clusters leveraging Kata Containers, network policy and network security group. We will share the first-hand experience on how we integrate Kata Containers with Kubernetes in production, highlighting the challenges we faced, difficult trade-offs among security, performance and cost, and how to work around the heterogeneity across different public cloud providers.

Materials: