logo
Dates

Author


Conferences

Tags

Sort by:  

Authors: Raymond de Jong, Anna Kapuścińska
2023-04-21

tldr - powered by Generative AI

The presentation discusses the challenges of observability and security in distributed systems and how psyllium and Hubble can address these challenges.
  • Psyllium and Hubble can provide observability and security in distributed systems
  • Existing mechanisms such as traditional monitoring devices and VPC logs fall short in providing context and scalability
  • Psyllium uses identity-based observability and security based on labels to secure and monitor traffic
  • Hubble provides a surface mesh solution for monitoring workflows and exporting flows to other platforms
  • Ready-to-use dashboards are available in Grafana marketplace for monitoring cluster and application performance
Authors: Filip Petkovski, Saswata Mukherjee
2023-04-21

tldr - powered by Generative AI

Thanos is an open-source solution for scaling Prometheus-based monitoring by providing a distributed highly-available metric system with long-term retention. It addresses challenges with scaling functionality like querying metrics across large time ranges via downsampling and ingesting metrics at scale.
  • Prometheus is a standalone monitoring system that scrapes metrics from applications and stores them locally, but it cannot handle a large multi-environment setup or retain data for a long period of time
  • Thanos fills the gaps in Prometheus by providing a global view, long-term retention, downsampling, and multi-tenancy features
  • Thanos achieves a global view by using a standalone service called PromQL and defining the store API, which allows the queryer to request time series data from any component
  • Thanos also provides global alerting and rule recording through the Thanos ruler, which executes alerting rules across the entire data set
  • Thanos sidecar can be configured to upload data from Prometheus into object storage, making it easier to store data on disk for longer periods of time and move disks around
Authors: Friedrich Gonzalez, Alan Protasio
2023-04-21

tldr - powered by Generative AI

The presentation discusses the reliability and features of Cortex, a project based on Prometheus and designed for Kubernetes.
  • Cortex is designed for Kubernetes and is not a separate project from Prometheus
  • Cortex uses Thanos for reliability and provides limits to ensure reliability
  • Cortex implements vulnerable replication to ensure data is replicated across instances
  • Cortex has upcoming projects such as Gateway, Down Sampling, Federated Rules, and Native Histogram
  • There are plans to improve observability on the Cortex layer for cardinality
Authors: Bartłomiej Płotka, Gouthan Veeramachaneni
2023-04-21

Download the code ahead of time. DCO Required.Wished you know how to write exporter in Go for Prometheus? How to use Prometheus APIs programmatically? Need to quickly instrument you Go code with Prometheus metrics? Join us to learn how to contribute, develop and test Prometheus integrations useful in day to day use. Unblock yourself and others! It's easier than you think!We will go through useful resources and ways to interact with the project and community, to create meaningful applications that use Prometheus effectively!Note: For interactive experience, please make sure to bring your laptop, your favourite IDE and Golang installed.This Contribfest session is designed to provide projects with the space and resources to tackle outstanding technical debt, security issues, or outstanding impactful feature requests. They are intended to provide a place for maintainers to meet contributors and potential contributors and work together on solving a problem.
Authors: Rodrigo Serra Inacio, Willian Saavedra Moreira Costa
2023-04-21

tldr - powered by Generative AI

Cloud Metrics is a scalable and resilient platform for monitoring both systems and environments of a bank. The key to building this platform was isolation and reducing noise between tenants. The main components used were Kubernetes, Prometheus, Grafana, and Alert Manager. The infrastructure was built using EKS and hosted in Sao Paulo, Brazil. Users access their metrics through Graphene and Prometheus images. Each tenant has their own account and bucket to store their metrics.
  • Cloud Metrics is a platform for monitoring both systems and environments of a bank
  • Isolation and reducing noise between tenants was key to building the platform
  • Main components used were Kubernetes, Prometheus, Grafana, and Alert Manager
  • Infrastructure was built using EKS and hosted in Sao Paulo, Brazil
  • Users access their metrics through Graphene and Prometheus images
  • Each tenant has their own account and bucket to store their metrics
Authors: Kemal Akkoyun, Matej Gera
2023-04-20

Download the code ahead of time. DCO Required.Thanos - a CNCF project - is a Prometheus-compatible scalable system for high availability and long-term storage of the metrics. Thanos takes various Prometheus functionalities and splits them into microservices. It allows Thanos to scale different parts of the system based on usage. Although this makes it possible for Thanos to achieve its aims, the complexity of such a distributed system comes at a cost. Operating Thanos requires knowing what you're doing and being up to date with the continuous improvements the project goes through. And there is no better way to gain this insight than by putting your finger on the pulse of the project!ContribFest will allow participants and companies to explore how easy it is to contribute to and maintain Thanos with the community. During our session, you'll go through the code base with the maintainers, get familiar with the contribution cycles and learn about our testing framework and CI/CD setup. We plan to involve the audience with interactive tutorials they can follow and ask questions on their way to becoming contributors.This Contribfest session is designed to provide projects with the space and resources to tackle outstanding technical debt, security issues, or outstanding impactful feature requests. They are intended to provide a place for maintainers to meet contributors and potential contributors and work together on solving a problem.
Authors: Kemal Akkoyun, Bryan Boreham
2023-04-19

As the 2nd oldest project in the CNCF, you have probably heard about Prometheus before. Prometheus is the de facto standard in cloud-native metrics monitoring and beyond, mainly because Kubernetes is designing its custom metrics engine for Prometheus. Nevertheless, the project maintainers will introduce you from the very beginning, followed by a deep dive into its internal and a list of the exciting new features that have been released recently or are in the pipeline. You will learn about many opportunities to use Prometheus, and we will cover a mix of introduction content, a deeper dive into current developments, and open Q&A at the end. We can even tempt you to contribute to the project yourself.
Authors: Ryota Sawada
2023-04-19

tldr - powered by Generative AI

The presentation discusses multi-cluster observability and the challenges involved in managing metrics and data retention across multiple clusters.
  • Cardinality and data retention are important aspects to consider in multi-cluster observability
  • Metrics can be fetched from running services like Prometheus, but data retention costs can add up quickly
  • Differentiating between clusters and applications is important for effective dashboarding
  • The presentation focuses on Istio, Prometheus, and Thanos as key projects for multi-cluster observability
  • The demo showcases the installation process for Istio and the creation of certificates for secure communication between clusters
Authors: Benjamin Raskin, Emma Wang
2022-10-28

tldr - powered by Generative AI

The presentation discusses the migration of infrastructure and application metrics from Stacy to Prometheus at DoorDash, and the challenges and learnings encountered during the process.
  • The migration involved over 130 services, 1500 dashboards, and more than 7000 alerts.
  • The use of histograms instead of percentiles was a difficult change for engineers to adapt to.
  • The instance label is a high cardinality label that needs to be pre-aggregated to reduce volume.
  • PromptCare's aggregation gateway was used for some metrics, but push models were limited to special cases.
  • Automating the monitoring onboarding process for teams is crucial.
  • The migration was completed in one year, resulting in over 27,000 alerts and 2200 dashboards.
  • Post-migration, DoorDash ingests over 15 million metrics per second and persists over 10 million metrics per second.
Authors: Matthias Loibl, Nadine Vehling
2022-10-28

In this talk, Nadine and Matthias will give an introduction to Pyrra, a project that aims to make Service Level Objectives (SLOs) with Prometheus manageable, accessible, and easy to use for everyone. Nadine will talk about the project approach and findings for creating an easy-to-use observability tool. Matthias will then walk the audience through setting up a Pyrra instance on Kubernetes and how to connect it with either Prometheus or Thanos. After a successful deployment every component of the cluster will get an SLO, starting with etcd, the Kubernetes API server and kubelet, CoreDNS, and at the end Prometheus and Pyrra itself. In the end, a demo will showcase an outage in the cluster and what the alerting will look like, discussing how the lives of on-call engineers have been improved.