Nix CI Benchmarks

Benchmarks for Nix CI build times across different CI platforms

We picked some open source projects with lots of stars that have Nix already set up, forked them, and ran various different Nix CIs on the same commits, measuring the time CI took. Because Nix is Nix, these CIs did the same work, broadly speaking, making the comparisons more significant than they would be across wholly different stacks.

We ran the following setups:

(1) GitHub Actions, in serial and parallel, without any caching.
(2) GitHub Actions in parallel with magic-nix-cache for caching.
(3) GitHub Actions in parallel with Cachix for caching.
(4) GitHub Actions with nixbuild.net for building (using the CI workflow, with remote store building).
(5) garnix (without incremental builds).

These setups span the range of using GitHub Actions for everything (1-2), to including external caches (3), to including external builds (4), to having external evaluators too, and not using GitHub runners for anything (5).

We picked more popular repos that already had largely working Nix builds. For a start, we also focused on:

x86_64-linux builds. All CIs tested also support aarch64-linux. nixbuild.net does not support aarch64-darwin and x86_64-darwin;
Flakes. garnix in the non-enterprise plan only supports flakes;
GPU builds. Only GitHub supports runners with GPUs currently.
Tests without virtualization (e.g., no NixOS tests). garnix and GitHub support this; nixbuild.net seems to have an Early Access on x86_64-linux, though it seems to be Early Access since 2021.

In the future, we might have different test types, excluding from those CIs that don't support the required features.

This benchmark, as well as this text, was created by people at garnix — one of the CIs benchmarked. This may influence the results in ways that are likely favorable to garnix. For example, there is publication bias coming from the fact that we probably wouldn't have created this benchmark if we didn't think we were faster than alternatives, or published it at the end if that assumption had been wrong.

In addition to looking for yourself at what it is doing, you should consider also running it yourself, with modifications, and against different repos. If your conclusions differ from the text or results here, open a PR against our benchmark repo! And you can also always enable these different CIs on your own repo to see exactly how they perform there.

It's useful to understand the methodology of these benchmarks to interpret them correctly. For the impatient, however, you can skip to the results.

Methodology

We wrote a script that:

Takes as an argument a repo to be tested. This repo must have a flake file.
Gets the last 10 commits in that repo. For each of those commits:
- It checks the commit out, and finds the derivations to build. This only includes x86_64-linux builds.
- For each of the CI setups benchmarked:
  - It makes the changes necessary to set up that type of CI (deleting the existing .github and creating a new one, for example).
  - Pushes that changed commit to a new branch in a separate repo.
  - Waits for the check suite to finish, or times out after 2 hours.
  - Records the timing GitHub gives us for the check suite.

You can see all the GitHub Actions workflow runs (for 1-4) here and the garnix logs here.

Note that the packages and checks that we checked may differ from the ones that are enabled on the repo's own CI.

For configuration:

We used the default GitHub Linux runner. GitHub Actions has runners of up to 96-cores, which cost correspondingly much. For setups 1-3, this would presumably speed up builds substantially; for 4, presumably less so. Changing this would drive up the costs of benchmarking by quite a lot, however.
Our Cachix cache was 50GB in size.
nixbuild.net was left on the default configuration, but the CI Workflow (with remote stores) was used.

Running it yourself

If you'd like to try it out yourself, you can follow these steps:

Fork the benchmark-github and benchmark-garnix repos;
Change the references in benchmarking.ts from garnix-io/benchmark-github and garnix-io/benchmark-garnix to your new forked repo names;
Add your own CACHIX_AUTH_TOKEN and NIXBUILD_NET_TOKEN to your GitHub repository secrets;
Enable garnix on your fork of benchmark-garnix;
Clone the benchmark repo;
Add a token for GitHub API calls (with e.g. gh login);
From inside your clone, run nix run .#benchmark -- <REPO> (for example, nix run .#benchmark atuinsh/atuin).

Important notes:

It will likely cost you some money.
It will take a long time, especially if you include the slower options (some of these repos took a full day to check!).
Check that all CI systems are succeeding and failing the same tests before looking at the timings! Sometimes CIs will finish quicker than they should because they are failing when they shouldn't. And some setups are fail-on-first-error (e.g. GitHub Actions in serial); most aren't.
garnix has a global cache. This means that (for public repos at least) if anyone built a particular commit, it'll likely be cached. For fairness, then, don't test garnix on repos that already use garnix, and don't retest on the same commits you've already tested.
Make sure benchmarks don't affect one another. For example, if you run two different types of CI that write to the same cache (Cachix, garnix, magic-nix-cache, and nixbuild.net), you will get artifically low timings
Queuing in builds is accounted for differently between garnix and CIs using GitHub Actions. The amount of time spend in a queue is not counted in the total time for GitHub Actions, but it is for garnix. It's somewhat rare that you will be queued with garnix, but potentially if you run multiple tests in parallel it will happen.

The results

We ran the benchmarks for three repos:

The commits picked were the last ten commits at the time the benchmark started. You can see the individual commit hashes by hovering on a datapoint

Note that by default we exclude the first commit from any calculations. This is for two reasons

The first commit is susceptible to various influences: in particular, with any CI that has a cache, to whether related builds were ever added to the cache. It's also, for setups with a cache, on average much slower than others.
For your average project, there will be many more commits than just ten. Letting the first commit contribute so much to the average would therefore be unrepresentative.

Include first commit Include other commits Include failed builds

Some notes:

Some of the agda/agda builds were (correctly) failing on all CIs. But GitHub Action Serial fails on first error, and so does not report all errors and all successes. We therefore added a > sign in the relevant averages
GitHub Action Serial also incorrectly failed on all crytic/echidna builds, and was therefore excluded from those graphs
GitHub Actions Parallel and nixbuild.net timed out in all or some (respectively) of their crytic/echidna builds. Our code stopped checking the timings after 2 hours, and so we labelled them with 2 hours and used that to calculate the average. But this is an underestimate: they did in fact finish, on average taking around 4 hours. We therefore again put a > sign in the relevant averages
garnix timed out during evaluation of two of the builds of the first cryptic/echidna commit. This is a different type of timeout than above, set in the CI instead of in the benchmark script. Because of this, the builds did in fact not continue. It was our mistake not to configure this limit to be higher so the results were more comparable. It's unclear how best to fairly account for all these cases. We adopted the same attitude as above, considering this a lower bound.

We can also summarize the average slowdown for each CI, relative to the fastest one. We calculated that by, for each repo, normalizing the average of each CI by the fastest, and then averaging those numbers.

Note that here we still included the timeouts and early failures. Therefore, the average for GitHub Action Serial when agda/agda is included, and GitHub Actions and nixbuild.net when crytic/echidna is included, should be thought of as a lower bound.

Include repositories:

Include first commit data Include failed builds

Analysis

A few facts stand out:

magic-nix-cache doesn't seem to help at all with build speeds over having no cache (i.e, over github-actions-parallel). This is surprising since Cachix, which also only differs in having a cache, speeds up builds considerably, and since the project's README claims a big speedup (which presumably at least some people are seeing if they're using it). Part of the reason seems to be slow uploads to the cache (example here). It could be that serial GitHub Actions builds with magic-nix-cache are faster than serial GitHub Actions builds without, since downloads can be kicked off early while waiting on CPU-intensive work.
It's not clear why several CIs choked so badly on cryptic/echidna. The issue seems to always have been the package echidna-redistributable, but sometimes it was seemingly very slow uploads, and sometimes slow builds (click on the Settings button and then "Show Timestamp" to see timestamps). Disk space also seems to have been an issue. It's an outlier that deserves more careful examination, though it does not seem to be a fluke — it happened consistently on the same package across CIs, and across tests many hours apart, even as other packages built successfully.
garnix performed best across all repos.

Future improvements

There are a few benchmarks missing:

magic-nix-cache with FlakeHub Cache (only available for private repos);
garnix with incremental builds (faster for most compiled languages, but involves some manual, per-repo work);
nixbuild.net without remote store building (slower, but more stable — though we didn't notice any instabilities).
Different runner sizes for GitHub Actions.
ARM Linux and Macs. nixbuild.net doesn't support Mac builds yet, and for the others benchmarking may get expensive, however.

Ideally, we would also be more systematic in what repositories we check. Most starred repositories that are not starter templates or documentation, with a flake.nix which builds a substantial part of the project (instead of e.g. just devshells) might be a good criterion. Unfortunately, we couldn't figure out how to get GitHub search to accurately show repos, ordered by stars, with a flake.nix.

It was probably a mistake to let GitHub Actions (serial) fail on first error, since it means it did less work than other CIs. If we were to rerun this, we would change that.

Loading data...

The code

You can read, fork, or contribute to the benchmark repo here.