logo
Dates

Author


Conferences

Tags

Sort by:  

Authors: Adnan Hodzic
2023-04-19

This talk covers ING’s MLP (Machine Learning Platform) 2+ year migration journey to Kubernetes. ING being the biggest bank in the Netherlands and one of the biggest world banks entails we work in a highly regulated environment and are subjected to rigorous policies in terms of control with IT process lifecycle. Being a data scientist in one such environment, who would like to deploy pre-trained machine learning models to Production, without much or any underlying SRE/deployment knowledge complicates things. That’s where MLP (Machine Learning Platform) steps in, as it takes care of all the above mentioned problems by serving as a model hosting platform. As an SRE Adnan will cover problems and limitations of the existing platform setup in the VM (Virtual Machine) world and the inception of an idea to migrate to Kubernetes. Which steps it took to start the realization of one such idea and its migration plan. Followed by resistance, inability to choose the ideal target destination, platform’s growth and challenge in supporting the current setup in its growing capacity and ultimately leading to scalability issues. All these factors lead to a perfect storm, which led to the inevitable. Migration to Kubernetes and how that process came to be.
Authors: Wojciech Tyczyński
2023-04-19

tldr - powered by Generative AI

Tips for dividing workloads among multiple clusters in Kubernetes
  • Networking is the most stressing for the control plane and where the biggest number of issues are seen
  • Understanding the size of churn forward or observed services is a significant factor in workload division
  • The current scalability limit of 5000 nodes is not a hard limit and there are no plans to push it further in open source
  • External factors like third-party controllers and ecosystem improvements need to be addressed
  • Using the watch protocol for getting large collections of data can help with memory consumption and system throughput
  • Graceful shutdowns can prevent the control plane from being blown out by hundreds of thousands of watches
  • Optimizations should be balanced with complexity versus return on investment trade-off
Authors: Harry Lee
2022-10-27

tldr - powered by Generative AI

The conference presentation discusses the design and implementation of a central data aggregation platform for a smart energy management system in South Africa, targeting big energy consumers such as office blocks, industrial factories, and the mining industry. The platform uses IoT devices to measure energy usage, estimate costs, and optimize electricity usage with automation. The presentation highlights the challenges of building a solution for companies and industrial plants located in rural areas with infrastructure limitations, intermittent internet connectivity, and power outages due to load shedding. The solution needs to be resilient, work offline, and use open-source technologies. Kubernetes is chosen for its resilience, high availability, and ability to run pods from previous states.
  • South Africa is facing an energy crisis due to a limited supply of electricity, which drives up costs and impacts businesses heavily reliant on electricity
  • The smart energy management system targets big energy consumers and uses IoT devices to measure energy usage, estimate costs, and optimize electricity usage with automation
  • The central data aggregation platform is designed to work offline, be resilient, and use open-source technologies
  • Kubernetes is chosen for its resilience, high availability, and ability to run pods from previous states
  • The solution needs to be flexible, work with existing network infrastructure, and reduce setup costs
  • Multiple teams are involved in building the IoT devices, gateway, IoT platform, and advanced data analytics in the cloud
Authors: Marcel Zięba
2022-10-27

tldr - powered by Generative AI

The presentation discusses the importance of scalability and reliability in Kubernetes and how to improve it.
  • Using immutable secrets can make Kubernetes API more reliable
  • Priority and fairness can increase the reliability of Kubernetes
  • Efficiently designed controllers with CRDs are not a problem
  • Node-oriented controllers can cause scalability issues
  • Redesigning individual components should be a last resort
  • Deprecating features should be avoided to prevent breaking users
  • Introducing more efficient ways of doing things can steer people towards more scalable regressions
  • Load testing can be helpful for component maintainers
Authors: Alexander Wels, Michael Henriksen, Ryan Hallisey, Kat Morgan
2022-10-26

Download the code ahead of time. DCO Required.The KubeVirt Maintainers will organize into small groups to help improve scalability of KubeVirt components.This Contribfest session is designed to provide projects with the space and resources to tackle outstanding technical debt, security issues, or outstanding impactful feature requests. They are intended to provide a place for maintainers to meet contributors and potential contributors and work together on solving a problem.
Authors: Wojciech Tyczyński, Marcel Zięba
2022-05-20

tldr - powered by Generative AI

The presentation discusses the implementation of efficient watch resumption or immutable secrets in Kubernetes to increase reliability and scalability. The speaker also talks about the tools and infrastructure used for scalability testing in Kubernetes.
  • Using immutable secrets can make Kubernetes API more reliable and reduce pressure on API servers
  • Priority and fairness are heavily worked on to increase Kubernetes reliability
  • Cluster loader two is a tool used for scalability testing in Kubernetes
  • Cubemark is a simulation of the cluster used for scalability testing instead of running 5000 nodes
  • Whole nodes and hollow nodes are used in Cubemark to simulate regular nodes without actually running pods
  • Hollow cube proxy is a part of Kubernetes that puts pressure on the API server
Authors: Alper Rifat Ulucinar
2022-05-18

tldr - powered by Generative AI

The talk discusses the performance issues related to the API server when installing thousands of CRDs and how to troubleshoot them using profiling tools. It also provides insights into the mechanics of CRDs and tips for getting changes into upstream.
  • Custom resources are used to extend the K8s API server with a declarative API
  • Initial attempts to install thousands of CRDs revealed severe performance issues related to the API server
  • Profiling tools can be used to troubleshoot API server performance issues
  • Real world data can help pinpoint the root causes of scaling issues
  • Insights into the mechanics of CRDs are provided
  • Tips for getting changes into upstream and moving the ecosystem forward are shared
Authors: Bryan Boreham, Alvin Lin
2021-10-15

Cortex is a time-series data store based on Prometheus. Cortex adds: - Scalability: run across dozens of servers to handle millions of samples per second. - Availability: if one server fails then work will be redirected to others. - Multi-tenancy: store data from different groups or customers, segregated so a user from one tenant cannot see data from another. - Durability: use cloud stores (such as S3) to reduce the chance of data loss. This session will provide an overview of Cortex, an update on recent news from the project, and a run-through of top 5 tips for running Cortex in production.
Authors: Wojciech Tyczyński, Marcel Zięba
2021-10-15

tldr - powered by Generative AI

The presentation discusses the efforts of SIG Scalability in defining and improving scalability in Kubernetes, as well as monitoring and guarding against performance regressions.
  • SIG Scalability is focused on defining what scalability means for Kubernetes and executing towards those goals
  • They work with individual SIGs to ensure improvements are made and contribute to cross-SIG improvements
  • Monitoring and measuring current scalability levels is critical to understanding progress towards goals
  • Guarding against performance regressions is important to maintain scalability
  • Scalability is a job for everyone in the community, not just a small group
Authors: Andreas Grabner
2021-10-13

Moving to k8s doesn’t prevent anyone from bad architectural decisions leading to performance degradations, scalability issues or violating your SLOs in production. In fact – building smaller services running in pods connected through service meshes are even more vulnerable to bad architectural or implementation choices. To avoid any bad deployments, the CNCF project Keptn provides automated SLO-based Performance Analysis as part of your CD process. Keptn automatically detects architectural and deployment changes that have a negative impact to performance and scalability. It uses SLOs (Service Level Objectives) to ensure your services always meet your objectives. The Keptn team has also put out SLO best practices to identify well known performance patterns that have been identified over the years analyzing hundreds of distributed software architectures deployed on k8s. Join this session and learn what these patterns are and how Keptn helps you prevent them from entering production.