logo

Building a Day1/Day2 Application Operations Platform On CNCF Projects.

2022-10-26

Authors:   Alois Reitbauer, Alex Jones


Summary

The presentation discusses the importance of automation and observability in day two operations for managing digital infrastructure using Kubernetes.
  • Data operations rely on automation and observability to remove humans from the equation.
  • Good Ops Kubernetes is an operator that enables the pattern of operation for managing the lifecycle of digital infrastructure.
  • SLOs and error budgets are becoming the driving force behind corrective actions for operators.
  • Extending the desired state of the system is necessary for day two operations to actively modify the system's configuration.
  • Enhancing context-less alerts with tracing is necessary for effective remediation workflows.
The presentation highlights the problem of context-less alerts in a massively distributed system. Without context, alerts for failing services can be overwhelming and difficult to remediate. However, by enhancing these alerts with tracing, it becomes easier to identify which specific service needs to be remediated, leading to more effective workflows.

Abstract

Effectively delivering and operating large and complex cloud-native applications becomes more and more important as companies move an increasing number of applications to Kubernetes. Most companies are building self-service platforms which can be used by individual teams while enabling companies to drive company wide practices. The cloud-native ecosystem provides a large number of projects that help with different aspects of building these platforms. In this talk we will cover all major aspects of the application lifecycle from build, test over to provision, delivery and release all the way to operational management and showcase different tools and how they can be used and combined together. After the talk you will be able to answer all the below questions and more: How can I best build cloud native applications? What are the best approaches to provide standard components like databases, etc? How can I provision infrastructure following the same cloud native approach I use for my application? How can best manage the deployment and rollout process? How can I seamlessly integrate practices like chaos testing? How can I automate the setup of operations requirements like security, observability, …? How can I automate day2 operations at an infrastructure and application level? We will focus on sharing concepts combines with small examples which help illustrate how different aspects can be done with different tools.

Materials: