Disaster Recovery of Stateful Applications in a Multi-Cluster Environment

Conference: KubeCon + CloudNativeCon North America 2021

2021-10-13

Authors: Shyam Ranganathan, Orit Wasserman

Summary

The presentation discusses disaster recovery of stateful applications in a multi-cluster environment using replication capable storage systems like Ceph/Rook.

Disaster recovery is important to ensure business continuity in case of data center loss.
Regional disaster recovery involves two separate remote sites with high network latency and two separate Kubernetes clusters.
Replication capable storage systems like Ceph/Rook can be leveraged to provide disaster recovery of workloads across clusters.
A multi-cluster control plane is required to enable one-click disaster recovery solution for stateful workloads.
Volume replication and volume application class are added to the standard CSI API to enhance capabilities.
Dynamic provisioning requires creating a matching PV in the recovery site and connecting it to the replicated volume.
Multi-cluster management requires equivalent cluster configurations and deployment of custom resources and operators on all clusters.

In the event of a data center loss, disaster recovery is crucial to ensure business continuity. Replication capable storage systems like Ceph/Rook can be used to provide disaster recovery of workloads across clusters. This involves creating a matching PV in the recovery site and connecting it to the replicated volume. Multi-cluster management is also important to ensure that custom resources and operators are deployed on all clusters. By implementing these measures, businesses can ensure that they are prepared for any potential disasters and can continue to serve their customers.

Abstract

Have you ever wondered how to provide for disaster recovery of the state stored in your persistent volumes? What needs to happen to recover the workload on an alternate kubernetes cluster? How can the state be replicated and workloads recreated to use their replicated volumes? Our talk aims to elaborate on the various issues around recovering a workload and its state, in a multi-cluster and a multi region environment. We will demonstrate how replication capable storage systems, such as Ceph/Rook, instead of higher level tools, can be leveraged to provide disaster recovery of workloads across clusters. In addition this session will tease out features required in a multi-cluster control plane, to enable one-click disaster recovery solution for stateful workloads. Attendees will learn how to approach building disaster recovery solutions for their own clouds.

Materials:

Tags: