Presentations | Hack Dojo

Sort by:

Show Me the Metrics: How a Huge Bank Does Observability with Multi-Tenancy Prometheus and Thanos

Conference: KubeCon + CloudNativeCon Europe 2023

Authors: Rodrigo Serra Inacio, Willian Saavedra Moreira Costa

2023-04-21

tldr - powered by Generative AI

Cloud Metrics is a scalable and resilient platform for monitoring both systems and environments of a bank. The key to building this platform was isolation and reducing noise between tenants. The main components used were Kubernetes, Prometheus, Grafana, and Alert Manager. The infrastructure was built using EKS and hosted in Sao Paulo, Brazil. Users access their metrics through Graphene and Prometheus images. Each tenant has their own account and bucket to store their metrics.

Cloud Metrics is a platform for monitoring both systems and environments of a bank
Isolation and reducing noise between tenants was key to building the platform
Main components used were Kubernetes, Prometheus, Grafana, and Alert Manager
Infrastructure was built using EKS and hosted in Sao Paulo, Brazil
Users access their metrics through Graphene and Prometheus images
Each tenant has their own account and bucket to store their metrics

Tags:

Show 0 Comments

The Day We Delete(d) Production

Conference: KubeCon + CloudNativeCon Europe 2023

Authors: Ricardo Rocha, Spyros Trigazis

2023-04-20

The Kubernetes infrastructure at CERN runs a variety of workloads, from scientific computing to critical services for campus and our physics accelerator complex. It’s important to offer the features and capabilities our users require, but even more the required high levels of service. In this session we present in detail a recent incident where a rogue maintenance tool deleted a third of our production capacity in minutes, how this resulted in no downtime with only service degradation and how we were able to recover in a short time. We describe our architecture to achieve high service availability, the options we took to reduce blast radius, the concept of “clusters as cattle” and how extensive use of gitops saved the day. We will also describe some lessons learned in the process, the detected cyclic dependencies when recovering from a major outage, and the corner cases where more care is needed for stateful workloads and multi-cluster scheduling. We will demo this on stage showing how real CERN services recover from what would not so long ago be events with a very serious impact. And how the effort from the last years has paid off, with our users responding calmly and positively while going through a major incident.

Tags:

Site Reliability Engineering

Show 0 Comments

Mission-Critical PostgreSQL Databases on Kubernetes

Conference: KubeCon + CloudNativeCon Europe 2023

Authors: Karen Jex

2023-04-20

tldr - powered by Generative AI

The presentation discusses the deployment of mission-critical PostgreSQL databases on Kubernetes, exploring the benefits, use-cases, and implementation of robust, secure, scalable, and easily manageable database architectures on Kubernetes.

Evolution of database architecture from bare metal to virtualization to containerization
Introduction to containers, container orchestration, and Kubernetes
Benefits of deploying databases on Kubernetes, including flexibility, scalability, and automation of DBA tasks
Demonstration of deploying a PostgreSQL cluster on Kubernetes
Anecdote about generating data in a database using PG bench

Tags:

Show 0 Comments

Availability and Storage Autoscaling of Stateful Workloads on Kubernetes

Conference: KubeCon + CloudNativeCon Europe 2023

Authors: Leila Vayghan

2023-04-19

This talk is a story of how Shopify runs a highly available and scalable stateful application on Kubernetes which is accessed securely over the internet. The application discussed is Elasticsearch which stores petabytes of data over the globe. Search is a fundamental component of an ecommerce platform and high availability is an important requirement for it. While Kubernetes has proven to be the perfect platform for deploying stateless applications, running stateful applications on this platform in a highly available and scalable manner can be complicated. This talk will discuss these challenges and will share the steps towards solving them. For example, Leila will explain the obstacles of implementing storage autoscaling and how using the existing Kubernetes features allowed seamless expansion of persistent disks that store critical search data. She will also explain how her team implemented a feature that allowed shrinking persistent disks without any data loss and saved costs by releasing unused storage. Leila will also explain how Envoy is used to allow clients to connect to Elasticsearch through Kubernetes' ingress. This talk will give insight into the challenges and rewards of running highly available and scalable stateful applications on Kubernetes.

Tags:

high availability

stateful applications

Kubernetes

Elasticsearch

scaling

Show 0 Comments

Sig Scheduling Deep Dive

Conference: KubeCon + CloudNativeCon Europe 2023

Authors: Aldo Culquicondor, Kante Yin

2023-04-19

tldr - powered by Generative AI

The talk discusses the latest enhancements in SIG Scheduling in Kubernetes and opportunities for better support for services and batch type workloads.

Improvements in scheduler performance for higher scheduling throughput
Better support for rolling updates in deployments while maintaining high availability
Introduction of the SchedulingGates knob for external integrators to control pod scheduling
Development of sponsored projects such as Kueue, scheduling plugins, and the descheduler
Discussion on priority and pod scheduling policies
Importance of paying attention to machine availability and idle pods

Tags:

Show 0 Comments

One VTOrc To Rule Them All – High Availability In a Distributed Database System

Conference: KubeCon + CloudNativeCon North America 2022

Authors: Deepthi Sigireddi, Manan Gupta

2022-10-27

tldr - powered by Generative AI

The presentation discusses the engineering approach taken by Vitess to solve the consensus problem in a high QPS environment while prioritizing performance over theoretical correctness.

Vitess is a single leader system that relies on a topology server for persistent state and recovery
The system prioritizes performance over theoretical correctness
Durability policy is defined as avoiding data loss and is configurable based on trade-offs between durability and availability
Leader election has three stages: revocation, choosing a new leader, and propagation
Planned and unplanned leader elections have different revocation processes

Tags:

Show 0 Comments

Enterprise Cloud Native Artifact Registry

Conference: KubeCon + CloudNativeCon North America 2022

Authors: Daojun Zhang, Yan Wang, Chenyu Zhang, Vadim Bauer

2022-10-26

tldr - powered by Generative AI

Harbor is an open source cloud-native registry project that stores, manages, signs, and scans content to solve common OCI artifact management challenges. The presentation covers advanced features of Harbor such as OCI artifact management in cloud environments, management of artifacts and their attachments, recommended settings for high concurrent use, and high availability deployments. The team also seeks feedback from users and contributors on current features and future roadmap.

Harbor is a trusted cloud-native registry that can store, sign, and scan content
Harbor supports any OCI-compatible artifacts
Harbor provides advanced features such as OCI artifact management in cloud environments, management of artifacts and their attachments, recommended settings for high concurrent use, and high availability deployments
Harbor is highly customizable and can be monitored using Prometheus
Harbor will deliver system-level robot accounts in addition to project-level robot accounts
Harbor is an open-source project with a thriving community

Tags:

Show 0 Comments

We're Marketers. If we can Learn Distributed Systems + K8s, so can you!

Conference: KubeCon + CloudNativeCon North America 2021

Authors: Paul Burt, Betty Junod

2021-10-15

tldr - powered by Generative AI

The presentation discusses the challenges of distributed systems and how Kubernetes addresses them through its design choices. It also compares Kubernetes to other modern systems and explores real-world cases of failures.

Distributed systems are challenging because failure is inevitable and requires designing systems to handle it gracefully.
Kubernetes is designed to handle failure through fault tolerance and traffic routing.
Other modern systems, such as Docker Swarm, HashiCorp Nomad, and K3s, have different approaches to handling failure.
DistSys concepts such as CAP theorem, Gossip protocols, High Availability, and the RAFT consensus algorithm are discussed.
Real-world cases, such as Target's 2019 cascading failure, are explored to illustrate the challenges of distributed systems.
Understanding the problems confronting distributed systems and what 'correct' looks like is essential for designing and operating them effectively.

Tags:

Show 0 Comments

Roll Out the Red Carpet for Production Kubernetes Clusters with a Kube-vip

Conference: KubeCon + CloudNativeCon North America 2021

Authors: Daniel Finneran

2021-10-13

tldr - powered by Generative AI

The presentation discusses the journey of developing Kube-vip, a project that provides highly available Kubernetes clusters for various infrastructures, and how it can be used to implement highly available networking and load balancer functionality for Kubernetes services.

The presenter started by trying to improve the deployment of Kubernetes clusters on bare-metal and taking them into production
Ensuring highly available access to clusters proved problematic to implement and implement into lifecycle patterns
Kube-vip evolved from trying to fix that one use case into a widely used project that provides highly available Kubernetes clusters for various infrastructures
Kube-vip uses leader election and clustering technology to ensure highly available access to Kubernetes clusters
Kube-vip relies on ARP and BGP protocols to update the network and route traffic to the correct node
Kube-vip can be used to implement highly available networking and load balancer functionality for Kubernetes services

Tags:

Show 0 Comments