Benchmarks for Nix CI build times across different CI platforms
We picked some open source projects with lots of stars that have Nix already set up, forked them, and ran various different Nix CIs on the same commits, measuring the time CI took. Because Nix is Nix, these CIs did the same work, broadly speaking, making the comparisons more significant than they would be across wholly different stacks.
We ran the following setups:
(1) GitHub Actions, in serial and parallel, without any caching.
(2) GitHub Actions in parallel with magic-nix-cache for caching.
(3) GitHub Actions in parallel with Cachix for caching.
(4) GitHub Actions with nixbuild.net for building (using the CI workflow, with remote store building).
(5) garnix (without incremental builds).
These setups span the range of using GitHub Actions for everything (1-2), to
including external caches (3), to including external builds (4), to
having external evaluators too, and not using GitHub runners for anything (5).
We picked more popular repos that already had largely working Nix builds. For a start, we also focused on:
x86_64-linux builds. All CIs tested also support aarch64-linux. nixbuild.net does not support aarch64-darwin and x86_64-darwin;
Flakes. garnix in the non-enterprise plan only supports flakes;
GPU builds. Only GitHub supports runners with GPUs currently.
Tests without virtualization (e.g., no NixOS tests). garnix and GitHub support this; nixbuild.net seems to have an Early Access on x86_64-linux, though it seems to be Early Access since 2021.
In the future, we might have different test types, excluding from those CIs
that don't support the required features.
It's useful to understand the methodology of these benchmarks to
interpret them correctly. For the impatient, however, you can
skip to the results.
Methodology
We wrote a script that:
Takes as an argument a repo to be tested. This repo must have a flake file.
Gets the last 10 commits in that repo. For each of those commits:
It checks the commit out, and finds the derivations to build. This only includes x86_64-linux builds.
It makes the changes necessary to set up that type of CI (deleting the existing .github and creating a new one, for example).
Pushes that changed commit to a new branch in a separate repo.
Waits for the check suite to finish, or times out after 2 hours.
Records the timing GitHub gives us for the check suite.
You can see all the GitHub Actions workflow runs (for 1-4) here and the garnix logs here.
Note that the packages and checks that we checked may differ from
the ones that are enabled on the repo's own CI.
For configuration:
We used the default GitHub Linux runner. GitHub Actions has runners of up to 96-cores, which cost correspondingly much. For setups 1-3, this would presumably speed up builds substantially; for 4, presumably less so. Changing this would drive up the costs of benchmarking by quite a lot, however.
Our Cachix cache was 50GB in size.
nixbuild.net was left on the default configuration, but the CI Workflow (with remote stores) was used.
Running it yourself
If you'd like to try it out yourself, you can follow these steps:
Add a token for GitHub API calls (with e.g. gh login);
From inside your clone, run nix run .#benchmark -- <REPO> (for example, nix run .#benchmark atuinsh/atuin).
Important notes:
It will likely cost you some money.
It will take a long time, especially if you include the slower options
(some of these repos took a full day to check!).
Check that all CI systems are succeeding and failing the same tests before
looking at the timings! Sometimes CIs will finish quicker than they
should because they are failing when they shouldn't. And some setups
are fail-on-first-error (e.g. GitHub Actions in serial); most aren't.
garnix has a global cache. This means that (for public repos at least) if anyone built a particular commit, it'll likely be cached. For fairness, then, don't test garnix on repos that already use garnix, and don't retest on the same commits you've already tested.
Make sure benchmarks don't affect one another. For example, if you run two different types of CI that write to the same cache (Cachix, garnix, magic-nix-cache, and nixbuild.net), you will get artifically low timings
Queuing in builds is accounted for differently between garnix and CIs using GitHub Actions. The amount of time spend in a queue is not counted in the total time for GitHub Actions, but it is for garnix. It's somewhat rare that you will be queued with garnix, but potentially if you run multiple tests in parallel it will happen.
The commits picked were the last ten commits at the time
the benchmark started. You can see the individual commit hashes by
hovering on a datapoint
Note that by default we exclude the first commit from any calculations.
This is for two reasons
The first commit is susceptible to various influences: in
particular, with any CI that has a cache, to whether related
builds were ever added to the cache. It's also, for
setups with a cache, on average much slower than others.
For your average project, there will be many more commits
than just ten. Letting the first commit contribute so much to
the average would therefore be unrepresentative.
Some notes:
Some of the agda/agda builds were (correctly) failing
on all CIs. But GitHub Action Serial fails on first error,
and so does not report all errors and all successes. We
therefore added a > sign in the relevant averages
GitHub Action Serial also incorrectly failed on
all crytic/echidna builds, and was therefore excluded
from those graphs
GitHub Actions Parallel and nixbuild.net timed out in all
or some (respectively) of their crytic/echidna builds.
Our code stopped checking the timings after 2 hours, and
so we labelled them with 2 hours and used that to calculate
the average. But this is an underestimate: they did in
fact finish, on average taking around 4 hours. We therefore
again put a > sign in the relevant averages
garnix timed out during evaluation of two of the builds of
the first cryptic/echidna commit. This is a different
type of timeout than above, set in the CI instead of in
the benchmark script. Because of this, the builds did in
fact not continue. It was our mistake not to configure
this limit to be higher so the results were more comparable.
It's unclear how best to fairly account for all these cases.
We adopted the same attitude as above, considering this
a lower bound.
We can also summarize the average slowdown for each CI,
relative to the fastest one. We calculated that by, for each
repo, normalizing the average of each CI by the fastest, and
then averaging those numbers.
Note that here we still included the timeouts and early
failures. Therefore, the average for GitHub Action Serial
when agda/agda is included, and GitHub Actions and nixbuild.net
when crytic/echidna is included, should be thought of as a
lower bound.
Analysis
A few facts stand out:
magic-nix-cache doesn't seem to help at all with build speeds
over having no cache (i.e, over github-actions-parallel).
This is surprising since Cachix, which also only differs in having a cache,
speeds up builds considerably, and since the project's
README
claims a big speedup (which presumably at least some people are seeing
if they're using it).
Part of the reason seems to be slow uploads to the cache (example here).
It could be that serial GitHub Actions builds with magic-nix-cache
are faster than serial GitHub Actions builds without, since downloads
can be kicked off early while waiting on CPU-intensive work.
It's not clear why several CIs choked so badly on cryptic/echidna.
The issue seems to always have been the package echidna-redistributable,
but sometimes it was seemingly
very slow uploads, and sometimes
slow builds (click on the Settings button and then "Show Timestamp" to see timestamps).
Disk space also seems to have been an issue.
It's an outlier that deserves more careful examination, though it
does not seem to be a fluke — it happened consistently on
the same package across CIs, and across tests many hours apart,
even as other packages built successfully.
garnix performed best across all repos.
Future improvements
There are a few benchmarks missing:
magic-nix-cache with FlakeHub Cache (only available for private repos);
garnix with incremental builds (faster for most compiled languages, but involves some manual, per-repo work);
nixbuild.net without remote store building (slower, but more stable — though we didn't notice any instabilities).
Different runner sizes for GitHub Actions.
ARM Linux and Macs. nixbuild.net doesn't support Mac builds yet, and for the others benchmarking may get expensive, however.
Ideally, we would also be more systematic in what repositories we check. Most starred repositories that are not starter templates or documentation, with a flake.nix which builds a substantial part of the project (instead of e.g. just devshells) might be a good criterion. Unfortunately, we couldn't figure out how to get GitHub search to accurately show repos, ordered by stars, with a flake.nix.
It was probably a mistake to let GitHub Actions (serial) fail on first
error, since it means it did less work than other CIs. If we were to rerun
this, we would change that.
Loading data...
Error loading data. Please check that dashboard_data.json exists.