Spark on Kubernetes: The Elastic Story


Authors:   Bowen Li, huichao zhao


The presentation discusses the design principles and architecture of a cloud-native Spark on Kubernetes platform, highlighting the benefits of cloud and Kubernetes and the need for auto-scaling based on cost-saving and elasticity.
  • Cloud and Kubernetes can solve problems of legacy infrastructure by providing on-demand, elastic, and scalable resources with strong resource isolation and cutting-edge security techniques.
  • Design principles include fully embracing public cloud and cognitive way of thinking, containerization for elasticity and reproducibility, and decoupling compute and storage for independent scaling.
  • The architecture of the cloud-native Spark on Kubernetes platform involves multiple Spark Kubernetes clusters, a Spark service gateway, and a multi-tenant platform with advanced features such as physical isolation and min/max capacity setting.
  • Auto-scaling is necessary for cost-saving and elasticity, and the presentation discusses the design of reactive auto-scaling and its productionization.
  • The platform has been running in production for a year, supporting many business-critical workloads for Apple AML.
The presenter mentions that in the cloud-native world, upgrading infrastructure no longer requires risky in-place upgrades. Instead, a completely new environment can be spun up and traffic gradually rolled over, with the ability to switch back instantaneously if there are any issues. This flexibility is a huge win for DevOps.


Apache Spark is a unified analytics engine for large-scale data processing. People are moving Spark and batch workload to Kubernetes due to its uprising popularity. There are many challenges to running Spark efficiently on Kubernetes, for example, supporting autoscaling-based workloads. In this talk, we discuss building a large scale Spark Service on top of Kubernetes. We will also walk through autoscaling on a multi-tenant platform with advanced features such as physical isolation, min/max capacity setting, bin-packing, scale-in and scale out controls, and more. These improvements show significant CPU and memory utilization savings for Spark on Kubernetes.Click here to view captioning/translation in the MeetingPlay platform!