Also posted here: https://github.com/bazel-contrib/SIG-rules-authors/issues/11...
I want to encourage you to think about locking in the current archive details, at least for archives that have already been served. Verifying that downloaded archives have the expected checksum is a critical best practice for software supply chain security. Training people to ignore checksum changes is training them to ignore attacks.
GitHub is a strong leader in other parts of supply chain security, and it can lead here too. Once GitHub has served an archive with a given checksum, it should guarantee that the archive has that checksum forever.
Nix is not the only system that takes this approach. The Go modules "directory hash" is roughly equivalent, although we defined it in terms of somewhat more standard tooling: it is the output of
sha256sum $(find . -type f | sort) | sha256sum
I am not here advocating that everyone switch to this basic directory hash either, because it's not a solution to the more general problem that many systems are solving, namely validating _any_ downloaded file, not just file archives.There are widespread, standard tools to run a SHA256 over a downloaded file, and those tools work on _any_ downloaded file. Essentially every programming language ships with or has easily accessible libraries to do the same. In contrast, there are not widespread, standard tools or libraries for the "NAR Hash" nor the Go "directory hash". Even if there were, such tools would need to be able to parse every kind of file that people might be downloading as part of a build, not just tar files.
It's a good solution in limited cases such as Nix and Go modules, but it's not the right end-to-end solution for all cases.
Not to mention, forcing people to use GitHub releases instead of just tags (which excludes every mirror of somewhere else)
- you use autoconf (or any other tool(s) that require generating code into the source archive; or - you have submodules (to which `git archive` is completely blind).
Note that `git-archive-all`[1] can help as long as your submodules don't do things like `[attr]custom-attr` in their `.gitattributes` as it is only allowed in the top-level `.gitattributes` file and cannot be added to the tree otherwise.
With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.
#15 [dev-builder 4/7] RUN --mount=type=secret,id=npm,dst=/root/.npmrc npm ci
#0 4.743 npm WARN deprecated querystring@0.2.0: The querystring API is considered Legacy. new code should use the URLSearchParams API instead.
#0 8.119 npm WARN tarball tarball data for http2@https://github.com/node-apn/node-http2/archive/apn-2.1.4.tar.gz (sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ==) seems to be corrupted. Trying again.
#0 8.164 npm ERR! code EINTEGRITY
#0 8.169 npm ERR! sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== integrity checksum failed when using sha512: wanted sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== but got sha512-GWBlkDNYgpkQElS+zGyIe1CN/XJxdEFuguLHOEGLZOIoDiH4cC9chggBwZsPK/Ls9nPikTzMuRDWfLzoGlKiRw==. (72989 bytes)
#0 8.176
#0 8.177 npm ERR! A complete log of this run can be found in:
#0 8.177 npm ERR! /root/.npm/_logs/2023-01-30T23_19_36_986Z-debug-0.log
#15 ERROR: process "/bin/sh -c npm ci" did not complete successfully: exit code: 1
This was working earlier today and the docker build/package.json haven't changed.``` Building aws-sdk-cpp[core,dynamodb,kinesis,s3]:x64-linux... -- Downloading https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... -> aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz... [DEBUG] To include the environment variables in debug output, pass --debug-env [DEBUG] Feature flag 'binarycaching' unset [DEBUG] Feature flag 'manifests' = off [DEBUG] Feature flag 'compilertracking' unset [DEBUG] Feature flag 'registries' unset [DEBUG] Feature flag 'versions' unset [DEBUG] 5612: popen( curl --fail -L https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... --create-dirs --output /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.ta r.gz.5612.part 2>&1) [DEBUG] 5612: cmd_execute_and_stream_data() returned 0 after 12643779 us Error: Failed to download from mirror set: File does not have the expected hash: url : [ https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... ] File path : [ /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz.5612.part ] Expected hash : [ 9b7fa80ee155fa3c15e3e86c30b75c6019dc1672df711c4f656133fe005f104e4a30f5a99f1c0a0c6dab42007b5695169cd312bd0938b272c4c7b05765ce3421 ] Actual hash : [ 503d49a8dc04f9fb147c0786af3c7df8b71dd3f54b8712569500071ee24c720a47196f4d908d316527dd74901cb2f92f6c0893cd6b32aaf99712b27ae8a56fb2 ] ```
The build looks up the github tar.gz release for each tag and commits the sha256sum of that file to the formula
What's odd is that all the _historical_ tags have broken release shasums. Does this mean the entire set of zip/tar.gz archives has been rebuilt? That could be a problem, as perhaps you cannot easily back out of this change...
However, if you change the compression algorithm used to generate the archive, it'll result in a different checksum! The content is the same, but the archive is not.
They are probably generated on-demand (and cached) from the Git repository, not prebuilt.
Unfortunately for this kind of service you need to actively fiddle with the bytes to prevent people from relying on an implementation detail like this and prevent them from digging you into a too big to fail api stability hole.
[1] Apparently googlesource did do this and just had people shift to using GitHub mirrors to avoid this problem.
You minimally read the docs, get something working and then leave it alone. Of course you're going to be pissed off when an implicit assumption which has been stable for a long time is broken.
This accurately describes my beef with golang
Microsoft was once renown for bug-compatibility so as not to break their users. The new wave of movers and breakers would forget that wisdom at their peril.
I know that the Bazel team reached out to GitHub in the past to get a confirmation that this behaviour could be relied on, and only after that was confirmed did they set that as recommendation across their ecosystem.
The hash that pops out of 'git archive' has nothing whatsoever to do with the commit hash and was historically stable more or less by accident: git feeds all files to 'tar' in tree order (which is fixed) and (unless you specify otherwise) always uses gzip with the same options. Since they no longer use gzip but an internal call to zlib, compression output will look different but will still contain the same tar inside.
That people have relied on this archive hash being stable is an indication of a major problem imho, because it might mean that people in their heads project integrity guarantees from the commit hash (which has such guarantees) onto the archive hash (which doesn't have those guarantees). I would suggest randomizing the archive hash on purpose by introducing randomness somewhere, so that people no longer rely on it.
The people who made the things you love have mostly moved on, and the brand is being run by different people with different values now.
There's a little bit of an argument that such things are a bait-and-switch, but such is the nature of a large and multigenerational corporation.
the logic people use to blame Microsoft is intense, man. literally any logical leap is valid except one that absolves Microsoft of anything, no matter how small.
For projects where I verify the download, gpg seems to be what all of them use (thinking of projects like etesync and restic here). Interesting that so many people relied on a zip being generated byte-for-byte identically every time.
GPG signs a hash of the message with the private key, and you verify that the signature matches the file hash.
Oh wait, what hash? :clown:
In the real world it will take millions of dollars of eng labor just to update the hashes to fix everything that's currently broken and millions more to actually implement something better and move everyone over to it.
This isn't worth it, GitHub needs to just revert the change and then engineer a way to keep hashes stable going forward.
"The amount of work done “out there” on hundreds or thousands of applications for a single little libcurl tweak can be enormous. The last time we bumped the ABI, we got a serious amount of harsh words and critical feedback and since then we’ve gotten many more users!"
I think everyone knows these files are generated on the fly, but it comes from old habits.
Firstly SHA is not a secure hash.
Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.
What am I missing?
Did cache hits save you? Did cache misses break your builds?
looks like we were completely unaffected, as no one made any updates to derivations referencing GitHub sources in a way that invalidated old entries (i.e. no version bumps, new additions, etc.).
> Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).
It's crazy how such a seemingly innocuous change, like this, could lead to such widespread loss in productivity across the globe.
The change was upstream from git itself, and it was to use the builtin (zlib-based) compression code in git, rather than shelling out to gzip.
But would the gzip binary itself give reproducible results across versions of gzip (and zlib)? Intuition seems to suggest it wouldn't, at least not always. And if not, was the "strategy" just to never update gzip or zlib on GitHub's servers? That seems like a non-starter...
I understand wanting fewer dependencies, but gut-reaction is that it's a bad move in the unsafe world of C to rewrite something that already has a far more audited, ubiquitous implementation.
https://public-inbox.org/git/1328fe72-1a27-b214-c226-d239099...
> uses 2% less CPU time. That's because the external gzip can run in
> parallel on its own processor, while the internal one works sequentially
> and avoids the inter-process communication overhead.
> What are the benefits? Only an internal sequential implementation can
> offer this eco mode, and it allows avoiding the gzip(1) requirement.
It seems like they changed it because it uses less CPU, which makes sense in a "we're a global git hosting company" perspective, but less so for users who run the command themselves. They intentionally made it 17% slower to save 2% of CPU time, which probably makes sense at their scale, but for every user who run the command locally to lose 17% more of time?
Looks like the author is the maintainer of "Git for Windows", and similar, which I can imagine makes for a reasonable argument for reducing dependencies. zlib is already a library dependency, just use that instead of needing people to bundle up a gzip binary along with git, too.
https://lore.kernel.org/git/pull.145.git.gitgitgadget@gmail....
Of course 17% more time may not really be that much for most processes. Are we talking about 17% more of a second or of an hour?
That's without even mentioning the absurdity of saving 2% CPU but still using zlib.
Depending on how you measure it, zlib might be considered significantly more ubiquitous than gzip itself. At any rate it’s certainly no less battle tested.
[1] https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
https://bugzilla.tianocore.org/show_bug.cgi?id=3099
At some point it was impossible to go a few weeks (or even days) without a github archive change (depending on which part of the "CDN" you hit), I guess they must have stabilized it at some point. Here is an old issue before GitHub had a community issue tracker:
https://github.com/isaacs/github/issues/1483
I am glad this is getting more attention, maybe now github will finally have a stable endpoint for archives.
[1] https://github.com/elesiuta/picosnitch/blob/master/.github/w...
==> Validating source files with b2sums...
labwc-0.6.1.tar.gz ... FAILED
==> ERROR: One or more files did not pass the validity check!Surely, Microsoft-Github's own internal builds would have started failing as a result of this change? Or do they not even canary releases internally at all?
"didn't read every commit in new version of git, realized after the fact"
[1] https://github.com/bazelbuild/rules_jvm_external/releases/ta...
[2] https://github.com/bazelbuild/rules_python/releases/tag/0.17...
[3] https://github.com/bazelbuild/rules_java/releases/tag/5.4.0
On the other hand this goes against the "verify before parse" principle so I have mixed feelings on Nix's approach.
It's the downstream tooling ( i.e. all the builds and package managers ) that need to clean their act up.
POTUS issued an EO and NIST have been following up, leading to the promotion of schemes such as spdx https://tools.spdx.org/app/about/
Where I work is also required to start documenting our supply chain as part of the (new, replacing PCI-DSS) PCI-SFF certification requirements, which requires end-to-end verification of artifacts that are deployed within PCI scope.
So really, the arguments about CPU time etc are basically silly. The use of SHA hashes for artifacts that don't change will be a requirement for anyone building industrial software, or supplying to government, or in the money transacting business.
However, I do think it's a bad idea to enforce the content of compressed archives to be deterministic. tar has never specified an ordering of its contents. Compression algorithms are parameterized for time and space, so their output should not be deterministic either. Both of these principles apply to zip as well. But we now have a situation where we are depending on both the archive format and the compression algorithm to produce a deterministic output. If we expect archives to behave this way in general, we set a bad precedent for all sorts of systems, not just git and GitHub.
Tar/zipball archives on the same ref never have a stable hash.
Forever problem 1:
No sha256/512/3 hashes of said tar/zipballs.
Forever problem 2:
No metalinks for those.
Forever problem 3:
Not IPv6. Some of our network is IPv6 only.
Forever problem 4:
Hitting secondary rate limiting because I can browse fast.
You can try it online here:
and relies on checksumming ephemeral artefacts for integrity.
GitHub unilaterally made that decision for their own convenience, and violated a decades-long universal community norm in the process.
Anyone remember the crazyness when Homebrew had problems with using GitHub for the same thing?
files uploaded to GH Packages are not modified by GitHub.
only the "Source Code (.zip)" and "Source Code (.tgz)" files that are part of releases and tags are affected because git generates them on demand, and git does not guarantee hash stability.
if you upload a package to GH Packages or upload a release asset to a GitHub releases those are never modified, and you can rely on those hashes.
GitHub chooses to do this. It's GitHub's choice to generate Source Code files on demand rather than when the release is made. It's a way of reducing their disk usage at the cost of this kind of potential problem.
The problem is they also presented it as if it was a stable reference. If people knew it was not stable they would have done what the Bazel devs are now talking about doing, which is also uploading the source code at release time, as an artifact (which is how it works on Nexus).
how? the docs state that the hashes of these files are not guaranteed to be stable.
the decision to generate those files on demand is a good one, provided that the behavior is documented, and it is.
others in this thread figured it out before this particular issue arose and made the necessary changes to their workflows so that their downloads would have stable hashes.
Keep it simple, just vendor your deps.
Github has pretty much a one-click ( or one API call ) workflow to create properly versioned and archived tarballs. Just because lots of people try to skirt proper version management doesn't mean you should commit the world into your repo
How it’s done in Chromium: <https://source.chromium.org/chromium/chromium/src/+/main:thi...>.
1. You work in a company, you are in a team, you want some reasonable code review process in place. Now you want to check in a 3rd party dependency, "let's vendor it!" so you send out a PR with ... 10,000 - 100,000 lines of code. Your reviewer has no reasonable way to know if a) the dependency was downloaded from a reputable source, if b) the code was not modified maliciously, c) there was some local patch / local change either voluntarily added or accidentally added (maybe you tried running configure/make locally, and didn't realize that one .h file was generated from your machine. A diligent reviewer would have to re-download the source tarball from a reputable source (is there the url in the commit message? A README? better hope there is!), unpack it locally, generate the set of files and all hashes, compare with your PR. And ensure that the PR / vendored dependency comes with a README or METADATA file so the download URL and LICENSE is recorded for posterity.
2. Now you need to update the dependency. Either it's a new directory (so you vendor both versions), or you have to delete all files that are gone. The PR review will be worse, as it will show a diff, except the reviewer won't review it, except to repeat the steps in 1. Without considering patches applied in the mean time, as the code was simply checked in the repository, and anyone could easily change it.
3. For anything but small/tiny projects, the vendoring will take up most of the download / checkout time of your repository.
If you use git for vendoring, the problem is not significantly better: if you care about the integrity of the vendored code, you need to verify the final tree, or the log / hash / set of commits.
Compare to using a simple file with a 1) url, 2) secure hash, 3) list of patches to apply. Reviewing and ensuring correctness is trivial, upgrading is trivial, PRs are trivial.
To avoid problems like the github problem here, a simple proxy or local cache is enough, a tool that takes the hash (or a hash of a url) and reads it from disk, is good enough. And detects corruption.
https://github.com/freebsd/freebsd-ports/commit/a43ec88422ee...