logo

Keynote: Push It to the Limit: From Canary Deployments to Canary Clusters

2022-05-20

Authors:   Henrik Høegh


Summary

The presentation discusses how Lunar bank achieved the capability of doing a failover using Githubs and moved from canary deployments to canary clusters. The main focus is on the tech stack used and the challenges faced during the process.
  • Lunar bank moved from canary deployments to canary clusters to meet the needs of their customers who rely on them to move quickly and provide new features in a highly reliable manner.
  • The production clusters were made truly disposable by deeply integrating with the infrastructure provider, writing new custom operators, and moving most state out of the cluster.
  • The company achieved the capability of doing a failover using Githubs, which was complex and required a lot of work.
  • The tech stack used included Kubernetes, Git Ups, Flux, AWS, S3 bucket, RabbitMQ, and external DNS.
  • The challenges faced included merge complexity in the Github repo, stalling of new deployments during the exercise, and discomfort among employees due to the complexity of the process.
The company created a branch for the new cluster and made some edits to it, such as the cluster name and routing weights. They then spun up another Kubernetes cluster, pointed Flux to it, and federated the two clusters. They made some edits in both branches and shifted the routing weight until they stopped using the old cluster. They removed the services and federation and merged the branch into the main branch, pointing Flux to it. However, this process was complex and caused discomfort among employees.

Abstract

At Lunar bank we had a good problem, our customers rely on us to move quickly and provide new features and to do so in a highly reliable manner. To meet their needs we set out on a journey to move from canary deployments, where we could test new features in a safe fashion, to canary clusters. We envisioned a world where our production clusters were truly disposable and after 3 years we finally achieved that goal. In this session we will share how we did it, and how you can too. Today any engineer at Lunar bank can fail over the entire platform in 40 minutes. By deeply integrating with our infrastructure provider, writing some new custom operators, and moving most state out of the cluster Lunar is in a position to make disaster recovery a day to day operation. Listen as Henrik shares the successes, key learnings, and challenges we faced along the way.Click here to view captioning/translation in the MeetingPlay platform!

Materials:

Post a comment

Related work

Authors: Rick Spencer, Wojciech Kocjan
2022-05-18


Authors: Anusha Ragunathan
2022-05-19



Authors: Arun M. Krishnakumar, Sahithi Ayloo
2023-04-19