Reproducible C++ builds by logging Git hashes (opens in new tab)

(jgarby.uk)

39 pointsj4cobgarby6mo ago43 comments

43 comments

A simpler way to do this, especially if you do tagging in your repositories, is to use `git describe`. For example:

    $ git describe --dirty
    v1.4.1-1-gde18fe90-dirty

The format is <the most recent tag>-<the number of commits since that tag>-g<the short git hash>-<dirty, but only if the repo is dirty>.

If the repo isn't dirty, then the hash you get excludes that part:

    $ git describe --dirty
    v1.4.1-1-gde18fe90

If you're using lightweight tags (the default) and not annotated tags (with messages and signatures and etc) you may want to add `--tags` because otherwise it'll skip over any lightweight tags.

The other nice thing about this is that, if the repo is not -dirty, you can use the output from `git describe` in other git commands to reference that commit:

    $ git show -s v1.4.1-1-gde18fe90
    commit de18fe907edda2f2854e9813fcfbda9df902d8f1 (HEAD -> 1.4.1-release, origin/HEAD, origin/1.4.1-release)
    Author: rockowitz <rockowitz@minsoft.com>
    Date:   Sun May 28 17:09:46 2023 -0400

        Create codacy.yml

WorldMaker6mo ago

`git describe` is great.

Also, if you don't feel ready to commit to tagging your repository you can start with the `--always` flag which falls back to just the short commit hash.

The article's script isn't far from `git describe --always --dirty`, which can be a good place to start, and then it gets better as you start tagging.

o11c6mo ago

The one caveat to this is that you must perform a sufficiently-deep clone that you can actually reach the tag.

halayli6mo ago

That barely scratches the surface when it comes to reproducible c and c++ builds. In fact the topic of reproducible builds assumes your sources are the same, as in that's really not the problem here.

You need to control every single library header version you are using outside your source like stdlibs, os headers, third party, and have a strategy to deal with rand/datetime variables that can be part of the binary.

hogehoge516mo ago

You also need to capture the version of the toolchain etc etc. Should also have a traceable link to the version of your specifications.

Just use ClearCase/ClearMake, it's been doing all of this software configuration auditing stuff for you since the 1990s.

WalterBright6mo ago

Also the compiler/linker used to build it.

matrss6mo ago

As well as the toolchain used to compile your toolchain, through multiple levels, and all compiler flags along the path, and so on, down to some "seed" from which everything is build.

Guix' full-source bootstrap is pretty enlightening on that topic: https://guix.gnu.org/manual/devel/en/html_node/Full_002dSour...

YayaScript6mo ago

How would you even start solving these?

syncsynchalt6mo ago

Take a look at the decade+ long effort that Debian has put into this problem: https://wiki.debian.org/ReproducibleBuilds

Here's a talk from 2024: https://debconf24.debconf.org/talks/18-reproducible-builds-t...

Several distros are above the 90% mark of all packages being byte-for-byte reproducible, and one or two have hit the 99% mark.

ignoramous6mo ago

> Several distros are above the 90% mark of all packages being byte-for-byte reproducible, and one or two have hit the 99% mark.

Simply incredible.

Explains F-Droid's recent success with Reproducible Builds (as some F-Droid maintainers are also active in the Debian scene): https://f-droid.org/en/2025/05/21/making-reproducible-builds...

matrss6mo ago

A good package manager, e.g. GNU Guix, let's you define a reproducible environment of all of your dependencies. This accounts for all of those external headers and shared libraries, which will be made available in an isolated build environment that only contains them and nothing else.

Eliminating nondeterminism from your builds might require some thinking, there are a number of places this can creep in (timestamps, random numbers, nondeterministic execution, ...). A good package manager can at least give you tooling to validate that you have eliminated nondeterminism (e.g. `guix build --check ...`).

Once you control the entire environment and your build is reproducible in principal, you might still encounter some fun issues, like "time traps". Guix has a great blog post about some of these issues and how they mitigate them: https://guix.gnu.org/en/blog/2024/adventures-on-the-quest-fo...

MomsAVoxell6mo ago

Virtualization, imho. Every build gets its own virtual machine, and once the build is released to the public, the VM gets cloned for continued development and the released VM gets archived.

I do this git tags thing with my projects - it helps immensely if the end user can hover over the company logo and get a tooltip with the current version, git tag and hash, and any other relevant information to the build.

Then, if I need to triage something specific, I un-archive the virtualized build environment, and everything that was there in the original build is still there.

This is a very handy method for keeping large code bases under control, and has been very effective over the years in going back to triage new bugs found, fixing them, and so on.

corysama6mo ago

Back in the PS2 era of game development, we didn't have much of virtual machines to work with. And, making a shippable build involved wacky custom hardware that wouldn't work in a VM anyway. So, instead we had The Build Machine.

The Build Machine would be used to make The Gold Master Disc. A physical DVD that would be shipped to the publisher to be reproduced hopefully millions of times. Getting The Gold Master Disc to a shippable state would usually take weeks because it involved burning a custom disc format for each build and there was usually no way to debug other than watching what happened on the game screen.

When The Gold Master Disc was finally finalized, The Build Machine would be powered down, unplugged, labeled "This is the machine that made The Gold Master Disc for Game XYZ. DO NOT DISCARD. Do not power on without express permission from the CTO." and archived in the basement forever. Or, until the company shut down. Then, who knows what happens to it.

But, there was always a chance that the publisher or Sony would come back and request to make a change for 1.0.1 version because of some subtle issue that was found later. You don't want to take any chances starting the build process over on a different machine. You make the minimal changes possible on The Build Machine and you get The Gold Master Disc 1.0.1 out ASAP.

1 more reply

hogehoge516mo ago

AFAIK ClearMake intercepted file system access and recorded the version of everything touched during your build.

chuckadams6mo ago

Give Nix a look sometime, it takes this to a whole new level by including all of the build dependencies in the hash, and their build dependencies and so on. The standard flake workflow even includes the warning about having uncommitted files.

ikety6mo ago

It's quite odd to me that Nix or something similar like Mise isn't completely ubiquitous in software. I feel like I went from having issues with build dependencies to having that aspect of software development completely solved as soon as I adopted Nix.

I absolutely can't imagine not using some kind of tool like this. Feels as vital as VCS to me now.

chuckadams6mo ago

We'd have been a lot further along if tools like make had ever adopted hashes for freshness checking rather than timestamps. We'd have ccache built in to make, make could hash entire targets, and now we're halfway to derivations. Of course that's handwaving over the tricky problem of making sure targets build reproducibly, but perhaps compiler toolchains would have taken more care to ensure it.

eptcyka6mo ago

I'd say the sad part is that nix really works well when the toolchain does caching transparently. But to deliver good DX outside of nix, you kind of want great porcelain tooling that handles everything behind the scenes - downloading of libraries, building said libraries, linking everything together. Sometimes people choose to just embed a whole programming language to make their build system work e.g. gradle. Cargo just does everything. Nix then can't really granularly build everything piece by piece when building rust crates with Cargo - you just get to rebuild every dependency any time the derivation is built and any one input changed. I wonder how much less time would've been wasted if newer languages chose to build on top of nix. Of course, nix would need to become slightly more compatible with Windows and other OSes for this to be practical.

bigfishrunning6mo ago

Timestamps have the property of being easily comparable; you can always tell if one file is older then the other. If you were to use hashes for the same purpose, you'd have to keep a database of expected hashes, and comparing them would be a less trivial task, etc. It's doable, but it would be a very differently designed (and much more computationally expensive) program then make.

1 more reply

peterldowns6mo ago

Agreed. Recently started a new gig and set up Mise (previously had used nix for this) in our primary repos so that we can all share dependencies, scripts, etc. The new monorepo mode is great. Basically no one has complained and it's made everyone's lives a lot easier. Can't imagine working any other way — having the same tools everywhere is really great.

I'll also say I have absolutely 0 regrets about moving from Nix to Mise. All the common tools we want are available, it's especially easy to install tools from pip or npm and have the environments automanaged. The docs are infinity times better. And the speed of install and shell sourcing is, you guessed it, much better. Initial setup and install is also fantastically easier. I understand the ideology behind Nix, and if I were working on projects where some of our tools weren't pre-packageable or had weird conflicting runtime lib problems I'd get it, but basically everything these days has prebuilt static binaries available.

chuckadams6mo ago

Mise is pretty nice, I'd recommend it over all the other gazillion version-manager things out there, but it's not without its own weak spots: I tried mise for a php project, neither of the backends available for php had a binary for macos, and both of them failed to build it. I now use a flake.nix, along with direnv and `use flake`. The nix language definitely makes for some baffling boilerplate around the dependencies list, but devs unfamiliar with nix can ignore it and just paste in the package name from nixpkgs search.

There's also jbadeau/mise-nix that lets you use flakes in mise, but I figured at that point I may as well just use flake.nix.

1 more reply

spooky_deep6mo ago

I can say why I bounced off of Nix.

Lots of package combinations didn’t work and I was not skilled enough to figure out why.

The error messages are terrible.

They don’t provide enough versions of packages. I want Python 3.10.4 exactly. But Nix packages by default only provide 3.10.something

I would love to use Nix everywhere, but it’s just too cumbersome for me.

chuckadams6mo ago

If the nix ecosystem moved entirely to flakes, you could just point at the flake in python's repo, pin it to the proper commit hash, and job's done. Might result in a lot of extra near-duplicate dependencies in the store, but that's unlikely to affect you at the level of Python. Otherwise yeah, you're stuck with whatever combinations were blessed by nixpkgs at the time, or with writing your own derivation.

And the error messages are ... well, yeah. I don't find the nix language as awful as some do, but it's still a functional language by and for functional programmers, and being lazy, a lot of errors surface in very non-obvious places. Ultimately Nix could use a declarative config format on top of everything, but I'd rather they ironed out the other issues first. Guix seems to be a bit further along there, but its platform options are more limited.

ikety6mo ago

Tried Mise?

zokier6mo ago

I think bazel is the tool lot of people are converging towards, but turns out that maintaining complex build setups is a lot of work.

steeleduncan6mo ago

Yes, especially as you can do things like

  nix run github:user/repo/commit

There is no need to keep anything around, or roll your own nix equivalent, you can just look up the output by commit.

groby_b6mo ago

This is many useful things, but it's far from a reproducible C++ build. That'd require you ensure bit-for-bit identic builds when you reproduce, and logging the repository state is just a tiny first step to get there.

https://nikhilism.com/post/2020/windows-deterministic-builds... is a good resource on some of the other steps needed. It's... a non-trivial journey :)

kazinator6mo ago

Git hashes have nothing whatsoever to do with whether you can do a clean build of the same tree twice with the same results, bit for bit.

Git hashes or tags can help identify what was built: the inputs.

You only need to know that for traceability: when you hold the released outputs, but do not hold (or are not sure you hold) the matching inputs.

If builds are reproducible, the traceability becomes more meaningful.

In the TXR project, have a ./configure option called --build-id. This sets an ID that is appended to the version, which is in the executable. It is nothing by default; not used. It is meant to be useful for people who interact with the code, so they can check what they are running (things can get confusing when you are going back and forth among versions, or making local changes).

If you set the build ID it to the word "git", then it is calculated using:

  git describe --tags --dirty

that's probably what this author should be using. It gives you a meaningful ID that is related to the most recent release tag, and whether the repo was dirty.

  $ git describe --tags --dirty
  txr-302-20-g77c99b74e-dirty

We are (sadly, only) 20 commits after 302, at a commit whose short hash is 77c99b74e, and the repo is in a modified state.

I have it rigged in the Makefile that it actually keeps track of the most recent build ID in a little .build_id file. If the build ID changes relative to what is in that file, the Makefile will force a rebuild of the .o files which incorporate the build ID.

Also, there is no need to be generating dynamic #include material just for this. A simple -Dsymbol=var option in the CFLAGS will define a preprocessor symbol:

  CFLAGS += -DMY_BUILD_ID=\"$(my_build_id)\"

shoo6mo ago

Yep, your way of framing it is clearer. Embedding version information in released binary artefacts helps answer the question of "what version of the software even produced this output/is crashing in production?". This is the problem that the author is focusing on, and it is an important thing to sort out early in any serious project, especially if you ship software that gets deployed to customer machines. Setting this up early will probably even pay for itself before the software is in production as knowing what version is deployed where can reduce wasted time due to confusion about which experimental version is deployed to what non prod environment.

It's addressing a distinct problem from "if we rebuild any given version, perhaps some later time, do we even get the same binary?" which is what people usually mean by "reproducible builds".

Your tip that injecting build ids can be done with linker flags without needing to generate header files is a great one.

Passing version info without code generation using linker flags can also be done in other languages & toolchains, e.g. with Go projects, the go linker exposes an -x flag that can be used to set the value of a string variable in a package [1] [2].

A step beyond this could be to explicitly build a feature into your software to help the user report bugs or request support, e.g. user clicks a button and the software dumps its own version info, info about what the user is doing & their machine, packages it up and sends in to your support queue. Doesn't make sense doing this for backend services, but you do see support features like this in PC games to help users easily send high quality bug reports.

[1] https://pkg.go.dev/cmd/link

[2] https://www.digitalocean.com/community/tutorials/using-ldfla...

ignoramous6mo ago

> Passing version info without code generation using linker flags can also be done in other languages & toolchains, e.g. with Go projects, the go linker exposes an -x flag

Someday, Go programs won't have to do this: https://github.com/golang/go/issues/50603

kazinator6mo ago

In short, "traceable bill of materials" != "reproducible build"

Which golfs to "traceable" != "reproducible"

ziotom786mo ago

As others have commented, this trick alone cannot ensure truly "reproducible" builds.

We used the same trick (git hash + git diff to monitor uncommitted changes) in a Python simulation framework we are developing for the JAXA/EU space mission "LiteBIRD." [1]

[1] https://iopscience.iop.org/article/10.1088/1475-7516/2025/11...

amadio6mo ago

For those of you using CMake, have a look at the module below:

https://github.com/xrootd/xrootd/blob/master/cmake/XRootDVer...

and also the genversion.sh script at the top of the repo.

I use these plus #cmakedefine and git tags to manage the project version without having to do it via commits.

j4cobgarbyOP6mo ago

Here's a short writeup of a bit of my build system for a project I'm working on. It's pretty simple, and is just a relatively clean way of recording the repository state when code was compiled, so I can reproduce results later on. Just thought the interaction between git, cmake, and C++ was a bit nice!

adamchol6mo ago

nix fixes this

had to be said

hogehoge516mo ago

Can I build my embedded firmware with nix using a Windows only toolchain?

(Fyi I just used something like the solution from the article, with the hash embedded in the binary image to be burned to ROM masks. The gaps in toolchain versioning and not building with dirty checkouts can be managed with self discipline /internal checks)

steeleduncan6mo ago

Generally development tools run fine under wine, so I'd guess it would be fine. Running a windows binary within wine within WSL on windows does seem a little insane tho!

hogehoge516mo ago

Now I think of it, WSL can generally call out to Windows tools - you would need to run in a Windows file system mounted into WSL. It just won't port to a Linux-based CI job without Wine. The ideal is a build and test run that is reproducible in CI and locally.

Scott-David6mo ago

Logging Git hashes makes C++ builds reproducible and easy to track."

j / k navigate · click thread line to collapse

43 comments

danudey6mo ago

A simpler way to do this, especially if you do tagging in your repositories, is to use `git describe`. For example:

    $ git describe --dirty
    v1.4.1-1-gde18fe90-dirty

The format is <the most recent tag>-<the number of commits since that tag>-g<the short git hash>-<dirty, but only if the repo is dirty>.

If the repo isn't dirty, then the hash you get excludes that part:

    $ git describe --dirty
    v1.4.1-1-gde18fe90

If you're using lightweight tags (the default) and not annotated tags (with messages and signatures and etc) you may want to add `--tags` because otherwise it'll skip over any lightweight tags.

The other nice thing about this is that, if the repo is not -dirty, you can use the output from `git describe` in other git commands to reference that commit:

    $ git show -s v1.4.1-1-gde18fe90
    commit de18fe907edda2f2854e9813fcfbda9df902d8f1 (HEAD -> 1.4.1-release, origin/HEAD, origin/1.4.1-release)
    Author: rockowitz <rockowitz@minsoft.com>
    Date:   Sun May 28 17:09:46 2023 -0400

        Create codacy.yml

WorldMaker6mo ago

`git describe` is great.

Also, if you don't feel ready to commit to tagging your repository you can start with the `--always` flag which falls back to just the short commit hash.

The article's script isn't far from `git describe --always --dirty`, which can be a good place to start, and then it gets better as you start tagging.

o11c6mo ago

The one caveat to this is that you must perform a sufficiently-deep clone that you can actually reach the tag.

halayli6mo ago

That barely scratches the surface when it comes to reproducible c and c++ builds. In fact the topic of reproducible builds assumes your sources are the same, as in that's really not the problem here.

hogehoge516mo ago

You also need to capture the version of the toolchain etc etc. Should also have a traceable link to the version of your specifications.

Just use ClearCase/ClearMake, it's been doing all of this software configuration auditing stuff for you since the 1990s.

WalterBright6mo ago

Also the compiler/linker used to build it.

matrss6mo ago

As well as the toolchain used to compile your toolchain, through multiple levels, and all compiler flags along the path, and so on, down to some "seed" from which everything is build.

Guix' full-source bootstrap is pretty enlightening on that topic: https://guix.gnu.org/manual/devel/en/html_node/Full_002dSour...

YayaScript6mo ago

How would you even start solving these?

syncsynchalt6mo ago

Take a look at the decade+ long effort that Debian has put into this problem: https://wiki.debian.org/ReproducibleBuilds

Here's a talk from 2024: https://debconf24.debconf.org/talks/18-reproducible-builds-t...

Several distros are above the 90% mark of all packages being byte-for-byte reproducible, and one or two have hit the 99% mark.

ignoramous6mo ago

> Several distros are above the 90% mark of all packages being byte-for-byte reproducible, and one or two have hit the 99% mark.

Simply incredible.

Explains F-Droid's recent success with Reproducible Builds (as some F-Droid maintainers are also active in the Debian scene): https://f-droid.org/en/2025/05/21/making-reproducible-builds...

matrss6mo ago

MomsAVoxell6mo ago

Virtualization, imho. Every build gets its own virtual machine, and once the build is released to the public, the VM gets cloned for continued development and the released VM gets archived.

Then, if I need to triage something specific, I un-archive the virtualized build environment, and everything that was there in the original build is still there.

This is a very handy method for keeping large code bases under control, and has been very effective over the years in going back to triage new bugs found, fixing them, and so on.

corysama6mo ago

1 more reply

hogehoge516mo ago

AFAIK ClearMake intercepted file system access and recorded the version of everything touched during your build.

chuckadams6mo ago

ikety6mo ago

I absolutely can't imagine not using some kind of tool like this. Feels as vital as VCS to me now.

chuckadams6mo ago

eptcyka6mo ago

bigfishrunning6mo ago

1 more reply

peterldowns6mo ago

chuckadams6mo ago

There's also jbadeau/mise-nix that lets you use flakes in mise, but I figured at that point I may as well just use flake.nix.

1 more reply

spooky_deep6mo ago

I can say why I bounced off of Nix.

Lots of package combinations didn’t work and I was not skilled enough to figure out why.

The error messages are terrible.

They don’t provide enough versions of packages. I want Python 3.10.4 exactly. But Nix packages by default only provide 3.10.something

I would love to use Nix everywhere, but it’s just too cumbersome for me.

chuckadams6mo ago

ikety6mo ago

Tried Mise?

zokier6mo ago

I think bazel is the tool lot of people are converging towards, but turns out that maintaining complex build setups is a lot of work.

steeleduncan6mo ago

Yes, especially as you can do things like

  nix run github:user/repo/commit

There is no need to keep anything around, or roll your own nix equivalent, you can just look up the output by commit.

groby_b6mo ago

https://nikhilism.com/post/2020/windows-deterministic-builds... is a good resource on some of the other steps needed. It's... a non-trivial journey :)

kazinator6mo ago

Git hashes have nothing whatsoever to do with whether you can do a clean build of the same tree twice with the same results, bit for bit.

Git hashes or tags can help identify what was built: the inputs.

You only need to know that for traceability: when you hold the released outputs, but do not hold (or are not sure you hold) the matching inputs.

If builds are reproducible, the traceability becomes more meaningful.

If you set the build ID it to the word "git", then it is calculated using:

  git describe --tags --dirty

that's probably what this author should be using. It gives you a meaningful ID that is related to the most recent release tag, and whether the repo was dirty.

  $ git describe --tags --dirty
  txr-302-20-g77c99b74e-dirty

We are (sadly, only) 20 commits after 302, at a commit whose short hash is 77c99b74e, and the repo is in a modified state.

Also, there is no need to be generating dynamic #include material just for this. A simple -Dsymbol=var option in the CFLAGS will define a preprocessor symbol:

  CFLAGS += -DMY_BUILD_ID=\"$(my_build_id)\"

shoo6mo ago

It's addressing a distinct problem from "if we rebuild any given version, perhaps some later time, do we even get the same binary?" which is what people usually mean by "reproducible builds".

Your tip that injecting build ids can be done with linker flags without needing to generate header files is a great one.

[1] https://pkg.go.dev/cmd/link

[2] https://www.digitalocean.com/community/tutorials/using-ldfla...

ignoramous6mo ago

> Passing version info without code generation using linker flags can also be done in other languages & toolchains, e.g. with Go projects, the go linker exposes an -x flag

Someday, Go programs won't have to do this: https://github.com/golang/go/issues/50603

kazinator6mo ago

In short, "traceable bill of materials" != "reproducible build"

Which golfs to "traceable" != "reproducible"

ziotom786mo ago

As others have commented, this trick alone cannot ensure truly "reproducible" builds.

We used the same trick (git hash + git diff to monitor uncommitted changes) in a Python simulation framework we are developing for the JAXA/EU space mission "LiteBIRD." [1]

[1] https://iopscience.iop.org/article/10.1088/1475-7516/2025/11...

amadio6mo ago

For those of you using CMake, have a look at the module below:

https://github.com/xrootd/xrootd/blob/master/cmake/XRootDVer...

and also the genversion.sh script at the top of the repo.

I use these plus #cmakedefine and git tags to manage the project version without having to do it via commits.

j4cobgarbyOP6mo ago

adamchol6mo ago

nix fixes this

had to be said

hogehoge516mo ago

Can I build my embedded firmware with nix using a Windows only toolchain?

steeleduncan6mo ago

Generally development tools run fine under wine, so I'd guess it would be fine. Running a windows binary within wine within WSL on windows does seem a little insane tho!

hogehoge516mo ago

Scott-David6mo ago

Logging Git hashes makes C++ builds reproducible and easy to track."

j / k navigate · click thread line to collapse