logo
Dates

Author


Conferences

Tags

Sort by:  

Authors: William Wang
2023-04-21

tldr - powered by Generative AI

Volcano is a cloud-native batch system that provides a unified job scheduling and management solution for Kubernetes clusters. It is designed to be scalable, flexible, and extensible, and it supports a wide range of workloads, including machine learning, data processing, and scientific computing.
  • Volcano is a cloud-native batch system that provides a unified job scheduling and management solution for Kubernetes clusters.
  • It is designed to be scalable, flexible, and extensible, and it supports a wide range of workloads, including machine learning, data processing, and scientific computing.
  • Volcano has several features that make it a powerful tool for managing batch workloads, including job scheduling, resource management, and job dependencies.
  • Volcano is used by a diverse group of users, including those in the AI and data areas, and it has a large and active community of contributors.
  • Volcano integrates with a variety of other tools and platforms, including Spark, Argo, and Airflow.
  • Volcano provides documentation and support for a wide range of training operators, including TensorFlow, MXNet, and MPI.
Authors: Xing Yang, Melissa Logan, Alvaro Hernandez, Sergey Pronin
2023-04-20

To handle Day-2 operations for data workloads on Kubernetes, organizations rely heavily on operators, but they present a number of challenges – including lack of integration with existing tools; lack of interoperability with the rest of their stack; varying degrees of quality; and lack of standardization. And yet – a majority of people are using at least 20 operators according to the 2022 Data on Kubernetes Report. For those evaluating their options, the challenge is further complicated by choice; the number of operators continues to grow with Operator Hub currently listing 270+. Without operator standards, how can end users possibly evaluate each one to know whether it meets their needs? This panel unites the Data on Kubernetes Community Operator SIG and Kubernetes Storage SIG to discuss key features of Kubernetes database operators -- what works, what doesn’t, and where the industry is going. Panelists will also present a feature matrix to help end users compare a multitude of database operators.
Authors: Chao Chen, Geeta Gharpure
2023-04-19

tldr - powered by Generative AI

Operational issues and their mitigations in running etcd
  • Database size exceeding
  • Revision divergence
  • Out of memory panic
  • Timeouts due to defrag
  • Oversized requests
Authors: Ricardo Aravena, Nikhita Raghunath
2023-04-19

tldr - powered by Generative AI

The presentation discusses the TAG Runtime and its working groups, as well as updates on various projects within the CNCF ecosystem.
  • TAG Runtime is a community of experts in AI, cybersecurity, and DevOps who work on projects within the CNCF ecosystem
  • The presentation highlights updates on various projects, including Flatcar, Keda, and IoT Edge
  • The TAG Runtime working groups include IoT Edge, WebAssembly, Kubernetes Tooling, and Llam
  • The group meets every first and third Thursday of the month and is considering expanding to a weekly cadence
  • The presentation also emphasizes the need for more contributions and involvement from interested parties
Authors: Abubakar Siddiq Ango
2022-10-27

tldr - powered by Generative AI

Choosing the right container runtime engine is crucial for different use cases. Docker is a good option for developers, but there are other options like Podman, GVizor, Kata, and Firecracker for more secure and isolated environments. Kubernetes can work with different OCI compliant runtimes.
  • Traditional deployment of applications can be unreliable
  • Virtualization creates isolated environments but can be limiting
  • Containers allow for deploying applications with all dependencies while still having access to host resources
  • Docker is a good option for developers but has restrictions
  • Podman can be a drop-in replacement for Docker and is more secure
  • GVizor, Kata, and Firecracker are options for more isolated environments
  • Kubernetes can work with different OCI compliant runtimes
Authors: Patrick Ohly, Alexander Kanevskiy, Kate Goldenring
2022-10-27

Kubernetes is powerfully declarative with YAML being the UX to request all that a workload needs. Kubernetes has tried to maintain this defining characteristic even as scenarios continue to expand. The device plugin interface was introduced to Kubernetes back in v1.10 to enable requesting and reserving static hardware for workloads, such as GPUs for ML applications. What about other devices used by workloads? This talk will cover several stories of how different types of the devices can be used in Kubernetes clusters: - From traditional datacenters to small IoT centric devices. - From exclusively accessed to shared devices. - From local stateless devices to network attached devices. - From simple single-purpose devices to pipelines of devices. All these scenarios require both a simple yet flexible UX for users to request a variety of devices with various properties. Alexander and Kate will discuss projects and proposals in the Kubernetes ecosystem that are working towards this goal of connecting devices and workloads. They will also discuss how to get involved in this evolution to let workloads be utterly materialistic. Whatever the app needs, it shall get.
Authors: Carlos Sanchez
2022-05-20

tldr - powered by Generative AI

Optimizing resource usage in Kubernetes clusters through hibernation and workload distribution
  • Built hibernation and workload distribution systems to optimize resource usage
  • Applied at both application and infrastructure levels
  • Recommendations for setting CPU and memory requests and limits
  • Use of standard VMs with CPU to memory ratio based on application usage
  • Explicitly setting JVM heap size to avoid surprises
Authors: Ricardo Rocha
2022-05-19

tldr - powered by Generative AI

The presentation discusses the challenges of implementing cloud native and high performance computing (HPC) and how recent work is bridging the gap between the two.
  • Cloud native and Kubernetes have become popular in modern IT deployments, but challenges remain in areas where HPC can have a larger impact.
  • HPC involves aggregating computing power to deliver higher performance for solving large problems in science, engineering, and business.
  • HPC deployments require low latency, high throughput, and numeral awareness, which are not common in most deployments.
  • Advanced scheduling is also important for HPC deployments with millions of jobs and users with different software needs.
  • The speaker shares an anecdote about CERN's experience with transitioning to Kubernetes for their HPC needs.
  • High throughput computing is a similar paradigm to HPC, but focuses on the efficient execution of a large number of loosely coupled tasks.
  • The speaker highlights the similarities between high throughput computing and cloud native systems.
Authors: Klaus Ma
2022-05-18

tldr - powered by Generative AI

Volcano is a cloud-native batch system for intelligent workloads such as HPC, AI, and big data. It addresses gaps in the cloud-native ecosystem for batch workloads, including job management, scheduling, support for different workloads, dynamic resource scheduling, and performance.
  • Volcano is a cloud-native batch system for intelligent workloads such as HPC, AI, and big data
  • It addresses gaps in the cloud-native ecosystem for batch workloads, including job management, scheduling, support for different workloads, dynamic resource scheduling, and performance
  • Volcano includes several components for multi-cloud scenarios, such as the job, queue, and controller
  • Volcano scheduler is built based on the cool batch, but introduces more scheduling algorithms and performance enhancements
  • Volcano was open-sourced in 2019 and donated to CNCF in 2020, and is now a graduated incubator project
Authors: Peter Hunt, Antti Kervinen
2021-10-15

tldr - powered by Generative AI

The presentation discusses the implementation of QoS (Quality of Service) in Kubernetes using block I/O classes and CPU manager policies to prioritize critical workloads.
  • QoS can be implemented in Kubernetes using block I/O classes and CPU manager policies
  • Block I/O classes can be used to prioritize workloads based on their importance
  • CPU manager policies can be used to assign specific CPU affinity to critical processes
  • Throttling can be used to limit resource contention and prioritize critical workloads
  • An anecdote is provided to illustrate the importance of prioritizing critical workloads