August 29, 2024

How We Made Our CI Pipeline 5x FasterReducing build time from 25 to 5 minutes with Dagger and Depot

We’ve never made a secret out of being huge fans of Dagger. Dagger has been a game-changer, enabling us to build portable CI pipelines that developers can run directly on their machines. This eliminated the dreaded push’n’pray—that endless cycle of pushing to Git and then praying the CI would turn green.

However, as our software grew in complexity, our CI pipeline running on GitHub Actions began to slow down. What started as a fast, efficient process became bogged down by an ever-increasing number of steps. Our average build time stretched to a painful 25 minutes at its peak. This wasn't just frustrating; it was a significant problem when trying to release a quick fix for a bug causing a production incident. A 25-minute delay in that scenario feels like an eternity.

In our initial attempts to solve this issue, we threw more resources at the problem. We purchased larger runners from GitHub, hoping it would speed things up, but it quickly became evident that these runners weren’t just expensive—they also didn’t deliver the speed boost we needed. Our builds weren’t just slow at that point; we also wasted money for no real gain.

Clearly, we needed to look elsewhere for a solution. It was a slow and painful process of trial and error, but I’m happy to report that we’ve successfully brought our average build time down from 25 to just 5 minutes. The improvement in productivity has been monumental, especially when we’re under pressure to push out a fix. We’re also looking at a 50% reduction in costs.

So, how did we do it? Let me walk you through the process.

#Caching is key

One of the most crucial factors in achieving our CI improvements was leveraging Dagger’s powerful caching mechanisms. In essence, Dagger offers two types of caching: layer cache and cache volumes. Both are instrumental in speeding up what would otherwise be time-consuming processes, especially in complex CI pipelines.

Consider the layer cache similar to how layers work when building a Dockerfile: each command in a Dockerfile creates a new layer, and if the content (or the instruction) of a layer doesn't change between builds, Docker can reuse that layer instead of rebuilding it from scratch. Dagger’s layer cache operates on the same principle. By caching the results of intermediate steps in your pipeline, Dagger allows you to skip the repetitive work when the steps haven't changed, significantly speeding up subsequent runs.

Cache volumes are analogous to cache mounts in a Dockerfile. Cache volumes store data that can be reused across pipeline runs, such as downloaded dependencies or built artifacts. These volumes persist between pipeline runs, so tasks that need this data don’t have to start from scratch every time.

Caching is what makes generally long processes much faster. For example, the first time you run a Dagger function, it might take some time, but running the same function again takes much less time, thanks to caching. This efficiency boost is especially noticeable when iterating on your code or testing changes.

While this caching process is relatively straightforward on a developer's local machine, where data persists across runs, it becomes more challenging in the cloud with ephemeral CI runners. These runners are typically spun up fresh for each CI job, meaning they don’t retain any state between runs. Making the cache available across these isolated runs is crucial to maintaining speed.

This is where Dagger Cloud steps in. Dagger Cloud provides a distributed cache service that seamlessly integrates with your CI pipelines, ensuring that the benefits of caching aren’t lost when you move from a local development environment to CI.

Dagger Cloud manages the distribution of both layer cache and cache volumes across your CI runners:

Layer Cache: The Dagger Engine downloads layers as needed during pipeline execution, ensuring that steps with unchanged layers are executed faster.
Cache Volumes: These are downloaded at the beginning of each run and uploaded back to Dagger Cloud at the end, allowing data to persist between pipeline executions, even on ephemeral CI runners.

Getting the most out of Dagger’s caching system requires understanding a few key concepts.

Optimizing the layer cache is similar to structuring a Dockerfile; the order of operations matters. Instructions that are less likely to change should be placed earlier in the pipeline so their layers can be cached effectively. For example, installing packages or dependencies should occur before steps that involve mounting files that frequently change, such as application source code. This ensures that these more stable layers are reused as often as possible.

Here is a simple example from our CI pipeline:

return dag.Container().
  From("alpine").
  WithExec([]string{"apk", "add", "--update", "--no-cache", "ca-certificates", "tzdata", "bash"}).
  WithLabel("org.opencontainers.image.title", "openmeter").
  // ...
  WithFile("/usr/local/bin/openmeter", m.Binary().api(platform, version)).
  // ...
  WithLabel("org.opencontainers.image.created", time.Now().String())

By moving the least changing steps (eg. installing packages) to the top, you can make sure those steps come from the layer cache (when possible). More frequently changing steps (like embedding the current timestamp into an image) should go to the bottom.

The strategy behind cache volumes is similar to managing cache mounts in a Dockerfile. You need to ensure that frequently reused data is stored in cache volumes so it can be quickly retrieved across different pipeline runs. Properly managing what gets stored in these volumes and when accessed can dramatically improve your pipeline’s efficiency. In our case, properly configuring cache volumes for the Go module and build cache resulted in a significant decrease in build time.

Here is an example from our pipeline specifying Go module and build cache volumes:

return dag.Go().
  WithModuleCache(dag.CacheVolume("go-mod")).
  WithBuildCache(dag.CacheVolume("go-build")).
  // ...
  Build()

Our pipelines utilize both types of caching. It took some time for us to figure out the best way to organize our pipeline to maximize the benefits of caching, but with careful optimization, we successfully reduced build times (on the larger GitHub Actions runners) from 25 to 10 minutes.

A word of caution: Caching is technically still an experimental feature of Dagger Cloud. While it had a few rough edges, it has worked reliably for us over the last few months.

#Custom GitHub Actions Runners

Although achieving a 2.5x improvement in build time with with Dagger Cloud’s caching was great, we knew there was still room for improvements. As mentioned in the previous section, the beefier GitHub Actions runners were not as effective as we hoped and were more expensive than running custom runners would be. This led us to explore other options that could provide even greater performance gains at a lower cost.

Our ideal solution was a set of ephemeral, automatically scaled runners that could handle our CI workloads efficiently. The state-of-the-art solution for this is ARC, a Kubernetes-based system that aligns perfectly with what we had in mind. ARC allows for creating ephemeral runners that automatically scale based on demand, providing the flexibility and efficiency we were looking for.

Incidentally, just as we were exploring potential solutions, the Dagger team published a post detailing their CI setup, which also uses ARC and integrates seamlessly with Dagger. This gave us confidence that ARC could be a viable solution for our needs.

We were preparing to create our own PoC based on ARC when we came across Depot.

Depot is everything we were looking for, minus the need to build a custom solution. Adding it to our workflows was as easy as replacing the runners:

jobs:
  build:
    name: Build
-    runs-on: ubuntu-latest-large
+    runs-on: depot-ubuntu-latest-8

Depot manages custom runners that are significantly more efficient than the ones GitHub offers. During our testing phase, we compared the performance of our larger GitHub Actions runners to several Depot runners of different sizes. Here are the results:

CPU cores	Provider	Avg duration	PC95 Duration	Price (per minute)
4	Depot	8.82 min	9.33 min	$0.008
8	GitHub	7.97 min	8.61 min	$0.032
8	Depot	5.55 min	6.51 min	$0.016
16	Depot	4.45 min	5.44 min	$0.032
32	Depot	4.28 min	4.82 min	$0.064

(Note: These workflow runs already utilized Dagger Cloud caching.)

The numbers are clear: Depot’s runners with the same (or similar) specifications outperformed GitHub’s runners at half the price. For the cost of an 8 CPU runner on GitHub, Depot provided us with a 16 CPU runner and delivered roughly a 2x improvement in build times. This is the type of runner we are currently using. As a result, we’ve not only managed to cut our build times in half again but also reduced our CI costs by approximately 50%. Not to mention the engineering hours we spared by not having to build our own solution.

#Summary

The combination of Dagger Cloud and Depot allowed us to dramatically accelerate our CI pipeline, cutting build times from 25 minutes to just 5 minutes while also reducing our costs by 50%. This powerful duo provided a seamless, efficient, and cost-effective solution, enabling us to focus on delivering high-quality software faster and more reliably.

Sági-Kazár Márk@sagikazarmark