logo

Fast Image Pulls Using IPFS And Opportunistic Caching

2022-10-28

Authors:   Christian Weichel, Manuel de Brito Fontes


Summary

The presentation discusses the efforts made by GetPad to speed up image pull times by implementing caching mechanisms using IPFS and a registry facade.
  • GetPad tried pre-pooling images, pre-baking them into VM images, and relying on Kubernetes mechanisms, but these methods were not effective in reducing image pull times.
  • The solution was to use a registry facade that dynamically assembles the manifest and points to an instance of IPFS for caching mechanisms.
  • Nerd control IPFS registry was explored but was not the solution to the problem.
  • IPFS is a peer-to-peer based distributed file system that can be used for caching mechanisms.
GetPad tried various methods to speed up image pull times, but they were not effective. The solution was to use a registry facade that dynamically assembles the manifest and points to an instance of IPFS for caching mechanisms. This helped reduce the p50 startup time by more than half, from 24 seconds to 10 seconds. IPFS is a peer-to-peer based distributed file system that can be used for caching mechanisms.

Abstract

Image pull times pose a considerable challenge when optimising for fast container starts within Kubernetes, due to potentially large images or network topology, bandwidth and egress cost constraints. Container runtimes offer layer-based node-local caches which help improve pull-times when there’s high layer-reuse, but find their limits when clusters need to scale quickly or there’s little control over the images which are used. We present the results of our efforts to bring down pull times down, which brought about considerable pull time improvement. Our goal was to optimise performance and networking cost, without imposing limits on the container images themselves. We went through several iterations which combined eStargz, nerdctl-registry with an in-cluster IPFS deployment. Using an opportunistic pull-through caching mechanism, we were able to considerably bring image pull times down without imposing extra burden on users (i.e. folks deploying the pods). We have been operating this setup in production on gitpod.io for over six months. In this session we will provide insight into our learnings, backed by the real-world data and observations we have gathered.

Materials: