Scaling Apache Spark on Kube to Apple Scale

Conference: KubeCon + CloudNativeCon Europe 2021

Authors: Holden Karau, Amanda Moran

Summary

The talk discusses scaling machine learning with Apache Spark on Kubernetes, including considerations and best practices for end users and advice for those migrating from YARN with HDFS to Kubernetes. The talk covers how to effectively deploy new enhancements of Spark on Kube, like shuffle tracking and graceful decommissioning, as well as when not to use this.

Introduction of speakers and their backgrounds
Recap of Spark architecture
Confusion around when to use Spark for machine learning
Spark is a powerful tool for machine learning
Considerations and best practices for end users of Spark on Kubernetes
Advice for those migrating from YARN with HDFS to Kubernetes
Effective deployment of new enhancements of Spark on Kube
When not to use Spark for machine learning

The speakers mention that they have a confession to make - the presentation is a recording of their past selves. They encourage the audience to ask questions in the chat and assure them that they will be available to answer them.

Abstract

Amanda and Holden will explore the customer workloads that easily ported to Apache Spark on Kubernetes, and which ones had more difficulty. The goal of this talk is to help the audience in their journey as either the operators of an Apache Spark-Kubernetes platform or as an end user. Considerations and best practices for end users of an Apache Spark on Kubernetes platform will be discussed. Additional advice for folks migrating from YARN with HDFS to Kubernetes will be included. This talk will include how to effectively deploy the new enhancements of Spark on Kube, like shuffle tracking and graceful decommissioning, as well as when not to use this.

Materials:

Slides

Tags: