logo
Dates

Author


Conferences

Tags

Sort by:  

Authors: Rodrigo Serra Inacio, Willian Saavedra Moreira Costa
2023-04-21

tldr - powered by Generative AI

Cloud Metrics is a scalable and resilient platform for monitoring both systems and environments of a bank. The key to building this platform was isolation and reducing noise between tenants. The main components used were Kubernetes, Prometheus, Grafana, and Alert Manager. The infrastructure was built using EKS and hosted in Sao Paulo, Brazil. Users access their metrics through Graphene and Prometheus images. Each tenant has their own account and bucket to store their metrics.
  • Cloud Metrics is a platform for monitoring both systems and environments of a bank
  • Isolation and reducing noise between tenants was key to building the platform
  • Main components used were Kubernetes, Prometheus, Grafana, and Alert Manager
  • Infrastructure was built using EKS and hosted in Sao Paulo, Brazil
  • Users access their metrics through Graphene and Prometheus images
  • Each tenant has their own account and bucket to store their metrics
Authors: Ricardo Rocha, Spyros Trigazis
2023-04-20

The Kubernetes infrastructure at CERN runs a variety of workloads, from scientific computing to critical services for campus and our physics accelerator complex. It’s important to offer the features and capabilities our users require, but even more the required high levels of service. In this session we present in detail a recent incident where a rogue maintenance tool deleted a third of our production capacity in minutes, how this resulted in no downtime with only service degradation and how we were able to recover in a short time. We describe our architecture to achieve high service availability, the options we took to reduce blast radius, the concept of “clusters as cattle” and how extensive use of gitops saved the day. We will also describe some lessons learned in the process, the detected cyclic dependencies when recovering from a major outage, and the corner cases where more care is needed for stateful workloads and multi-cluster scheduling. We will demo this on stage showing how real CERN services recover from what would not so long ago be events with a very serious impact. And how the effort from the last years has paid off, with our users responding calmly and positively while going through a major incident.
Authors: Karen Jex
2023-04-20

tldr - powered by Generative AI

The presentation discusses the deployment of mission-critical PostgreSQL databases on Kubernetes, exploring the benefits, use-cases, and implementation of robust, secure, scalable, and easily manageable database architectures on Kubernetes.
  • Evolution of database architecture from bare metal to virtualization to containerization
  • Introduction to containers, container orchestration, and Kubernetes
  • Benefits of deploying databases on Kubernetes, including flexibility, scalability, and automation of DBA tasks
  • Demonstration of deploying a PostgreSQL cluster on Kubernetes
  • Anecdote about generating data in a database using PG bench
Authors: Leila Vayghan
2023-04-19

This talk is a story of how Shopify runs a highly available and scalable stateful application on Kubernetes which is accessed securely over the internet. The application discussed is Elasticsearch which stores petabytes of data over the globe. Search is a fundamental component of an ecommerce platform and high availability is an important requirement for it. While Kubernetes has proven to be the perfect platform for deploying stateless applications, running stateful applications on this platform in a highly available and scalable manner can be complicated. This talk will discuss these challenges and will share the steps towards solving them. For example, Leila will explain the obstacles of implementing storage autoscaling and how using the existing Kubernetes features allowed seamless expansion of persistent disks that store critical search data. She will also explain how her team implemented a feature that allowed shrinking persistent disks without any data loss and saved costs by releasing unused storage. Leila will also explain how Envoy is used to allow clients to connect to Elasticsearch through Kubernetes' ingress. This talk will give insight into the challenges and rewards of running highly available and scalable stateful applications on Kubernetes.
Authors: Aldo Culquicondor, Kante Yin
2023-04-19

tldr - powered by Generative AI

The talk discusses the latest enhancements in SIG Scheduling in Kubernetes and opportunities for better support for services and batch type workloads.
  • Improvements in scheduler performance for higher scheduling throughput
  • Better support for rolling updates in deployments while maintaining high availability
  • Introduction of the SchedulingGates knob for external integrators to control pod scheduling
  • Development of sponsored projects such as Kueue, scheduling plugins, and the descheduler
  • Discussion on priority and pod scheduling policies
  • Importance of paying attention to machine availability and idle pods
Authors: Deepthi Sigireddi, Manan Gupta
2022-10-27

tldr - powered by Generative AI

The presentation discusses the engineering approach taken by Vitess to solve the consensus problem in a high QPS environment while prioritizing performance over theoretical correctness.
  • Vitess is a single leader system that relies on a topology server for persistent state and recovery
  • The system prioritizes performance over theoretical correctness
  • Durability policy is defined as avoiding data loss and is configurable based on trade-offs between durability and availability
  • Leader election has three stages: revocation, choosing a new leader, and propagation
  • Planned and unplanned leader elections have different revocation processes
Authors: Daojun Zhang, Yan Wang, Chenyu Zhang, Vadim Bauer
2022-10-26

tldr - powered by Generative AI

Harbor is an open source cloud-native registry project that stores, manages, signs, and scans content to solve common OCI artifact management challenges. The presentation covers advanced features of Harbor such as OCI artifact management in cloud environments, management of artifacts and their attachments, recommended settings for high concurrent use, and high availability deployments. The team also seeks feedback from users and contributors on current features and future roadmap.
  • Harbor is a trusted cloud-native registry that can store, sign, and scan content
  • Harbor supports any OCI-compatible artifacts
  • Harbor provides advanced features such as OCI artifact management in cloud environments, management of artifacts and their attachments, recommended settings for high concurrent use, and high availability deployments
  • Harbor is highly customizable and can be monitored using Prometheus
  • Harbor will deliver system-level robot accounts in addition to project-level robot accounts
  • Harbor is an open-source project with a thriving community
Authors: Paul Burt, Betty Junod
2021-10-15

tldr - powered by Generative AI

The presentation discusses the challenges of distributed systems and how Kubernetes addresses them through its design choices. It also compares Kubernetes to other modern systems and explores real-world cases of failures.
  • Distributed systems are challenging because failure is inevitable and requires designing systems to handle it gracefully.
  • Kubernetes is designed to handle failure through fault tolerance and traffic routing.
  • Other modern systems, such as Docker Swarm, HashiCorp Nomad, and K3s, have different approaches to handling failure.
  • DistSys concepts such as CAP theorem, Gossip protocols, High Availability, and the RAFT consensus algorithm are discussed.
  • Real-world cases, such as Target's 2019 cascading failure, are explored to illustrate the challenges of distributed systems.
  • Understanding the problems confronting distributed systems and what 'correct' looks like is essential for designing and operating them effectively.
Authors: Daniel Finneran
2021-10-13

tldr - powered by Generative AI

The presentation discusses the journey of developing Kube-vip, a project that provides highly available Kubernetes clusters for various infrastructures, and how it can be used to implement highly available networking and load balancer functionality for Kubernetes services.
  • The presenter started by trying to improve the deployment of Kubernetes clusters on bare-metal and taking them into production
  • Ensuring highly available access to clusters proved problematic to implement and implement into lifecycle patterns
  • Kube-vip evolved from trying to fix that one use case into a widely used project that provides highly available Kubernetes clusters for various infrastructures
  • Kube-vip uses leader election and clustering technology to ensure highly available access to Kubernetes clusters
  • Kube-vip relies on ARP and BGP protocols to update the network and route traffic to the correct node
  • Kube-vip can be used to implement highly available networking and load balancer functionality for Kubernetes services