The presentation discusses the concept of shuffle sharding as a solution to prevent single node outages and improve isolation between tenants in a horizontally scalable cortex system.
- Cortex system aims to be horizontally scalable by hashing labels within samples and spreading data among nodes in a cluster
- Replication factor of three and quarant reads and writes are used to prevent single node outages
- Shuffle sharding builds small virtual clusters inside a larger real cluster to improve isolation between tenants
- Shuffle sharding is a more flexible and manageable solution compared to cellular approach of mapping tenants to clusters
- Amazon sponsored Grafana Labs to make changes to Cortex and worked closely with them on the design and review
The speaker mentions how the chance of a total outage on the cluster is getting higher as Cortex clusters get bigger and bigger, and how a poison request or a bad query could take out an entire cluster for all tenants. Shuffle sharding is presented as a solution to this problem, as it builds small virtual clusters inside a much larger real cluster, improving isolation between tenants at not a huge expense in terms of utilization.