Also posted here: https://github.com/bazel-contrib/SIG-rules-authors/issues/11...
I want to encourage you to think about locking in the current archive details, at least for archives that have already been served. Verifying that downloaded archives have the expected checksum is a critical best practice for software supply chain security. Training people to ignore checksum changes is training them to ignore attacks.
GitHub is a strong leader in other parts of supply chain security, and it can lead here too. Once GitHub has served an archive with a given checksum, it should guarantee that the archive has that checksum forever.
#15 [dev-builder 4/7] RUN --mount=type=secret,id=npm,dst=/root/.npmrc npm ci
#0 4.743 npm WARN deprecated querystring@0.2.0: The querystring API is considered Legacy. new code should use the URLSearchParams API instead.
#0 8.119 npm WARN tarball tarball data for http2@https://github.com/node-apn/node-http2/archive/apn-2.1.4.tar.gz (sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ==) seems to be corrupted. Trying again.
#0 8.164 npm ERR! code EINTEGRITY
#0 8.169 npm ERR! sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== integrity checksum failed when using sha512: wanted sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== but got sha512-GWBlkDNYgpkQElS+zGyIe1CN/XJxdEFuguLHOEGLZOIoDiH4cC9chggBwZsPK/Ls9nPikTzMuRDWfLzoGlKiRw==. (72989 bytes)
#0 8.176
#0 8.177 npm ERR! A complete log of this run can be found in:
#0 8.177 npm ERR! /root/.npm/_logs/2023-01-30T23_19_36_986Z-debug-0.log
#15 ERROR: process "/bin/sh -c npm ci" did not complete successfully: exit code: 1
This was working earlier today and the docker build/package.json haven't changed.``` Building aws-sdk-cpp[core,dynamodb,kinesis,s3]:x64-linux... -- Downloading https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... -> aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz... [DEBUG] To include the environment variables in debug output, pass --debug-env [DEBUG] Feature flag 'binarycaching' unset [DEBUG] Feature flag 'manifests' = off [DEBUG] Feature flag 'compilertracking' unset [DEBUG] Feature flag 'registries' unset [DEBUG] Feature flag 'versions' unset [DEBUG] 5612: popen( curl --fail -L https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... --create-dirs --output /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.ta r.gz.5612.part 2>&1) [DEBUG] 5612: cmd_execute_and_stream_data() returned 0 after 12643779 us Error: Failed to download from mirror set: File does not have the expected hash: url : [ https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... ] File path : [ /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz.5612.part ] Expected hash : [ 9b7fa80ee155fa3c15e3e86c30b75c6019dc1672df711c4f656133fe005f104e4a30f5a99f1c0a0c6dab42007b5695169cd312bd0938b272c4c7b05765ce3421 ] Actual hash : [ 503d49a8dc04f9fb147c0786af3c7df8b71dd3f54b8712569500071ee24c720a47196f4d908d316527dd74901cb2f92f6c0893cd6b32aaf99712b27ae8a56fb2 ] ```
The build looks up the github tar.gz release for each tag and commits the sha256sum of that file to the formula
What's odd is that all the _historical_ tags have broken release shasums. Does this mean the entire set of zip/tar.gz archives has been rebuilt? That could be a problem, as perhaps you cannot easily back out of this change...
However, if you change the compression algorithm used to generate the archive, it'll result in a different checksum! The content is the same, but the archive is not.
They are probably generated on-demand (and cached) from the Git repository, not prebuilt.
[1] Apparently googlesource did do this and just had people shift to using GitHub mirrors to avoid this problem.
You minimally read the docs, get something working and then leave it alone. Of course you're going to be pissed off when an implicit assumption which has been stable for a long time is broken.
Microsoft was once renown for bug-compatibility so as not to break their users. The new wave of movers and breakers would forget that wisdom at their peril.
I know that the Bazel team reached out to GitHub in the past to get a confirmation that this behaviour could be relied on, and only after that was confirmed did they set that as recommendation across their ecosystem.
The hash that pops out of 'git archive' has nothing whatsoever to do with the commit hash and was historically stable more or less by accident: git feeds all files to 'tar' in tree order (which is fixed) and (unless you specify otherwise) always uses gzip with the same options. Since they no longer use gzip but an internal call to zlib, compression output will look different but will still contain the same tar inside.
That people have relied on this archive hash being stable is an indication of a major problem imho, because it might mean that people in their heads project integrity guarantees from the commit hash (which has such guarantees) onto the archive hash (which doesn't have those guarantees). I would suggest randomizing the archive hash on purpose by introducing randomness somewhere, so that people no longer rely on it.
The people who made the things you love have mostly moved on, and the brand is being run by different people with different values now.
There's a little bit of an argument that such things are a bait-and-switch, but such is the nature of a large and multigenerational corporation.
the logic people use to blame Microsoft is intense, man. literally any logical leap is valid except one that absolves Microsoft of anything, no matter how small.
For projects where I verify the download, gpg seems to be what all of them use (thinking of projects like etesync and restic here). Interesting that so many people relied on a zip being generated byte-for-byte identically every time.
GPG signs a hash of the message with the private key, and you verify that the signature matches the file hash.
Oh wait, what hash? :clown:
In the real world it will take millions of dollars of eng labor just to update the hashes to fix everything that's currently broken and millions more to actually implement something better and move everyone over to it.
This isn't worth it, GitHub needs to just revert the change and then engineer a way to keep hashes stable going forward.
I think everyone knows these files are generated on the fly, but it comes from old habits.
Did cache hits save you? Did cache misses break your builds?
looks like we were completely unaffected, as no one made any updates to derivations referencing GitHub sources in a way that invalidated old entries (i.e. no version bumps, new additions, etc.).
> Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).
It's crazy how such a seemingly innocuous change, like this, could lead to such widespread loss in productivity across the globe.
The change was upstream from git itself, and it was to use the builtin (zlib-based) compression code in git, rather than shelling out to gzip.
But would the gzip binary itself give reproducible results across versions of gzip (and zlib)? Intuition seems to suggest it wouldn't, at least not always. And if not, was the "strategy" just to never update gzip or zlib on GitHub's servers? That seems like a non-starter...
I understand wanting fewer dependencies, but gut-reaction is that it's a bad move in the unsafe world of C to rewrite something that already has a far more audited, ubiquitous implementation.
https://public-inbox.org/git/1328fe72-1a27-b214-c226-d239099...
> uses 2% less CPU time. That's because the external gzip can run in
> parallel on its own processor, while the internal one works sequentially
> and avoids the inter-process communication overhead.
> What are the benefits? Only an internal sequential implementation can
> offer this eco mode, and it allows avoiding the gzip(1) requirement.
It seems like they changed it because it uses less CPU, which makes sense in a "we're a global git hosting company" perspective, but less so for users who run the command themselves. They intentionally made it 17% slower to save 2% of CPU time, which probably makes sense at their scale, but for every user who run the command locally to lose 17% more of time?
Depending on how you measure it, zlib might be considered significantly more ubiquitous than gzip itself. At any rate it’s certainly no less battle tested.
[1] https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
https://bugzilla.tianocore.org/show_bug.cgi?id=3099
At some point it was impossible to go a few weeks (or even days) without a github archive change (depending on which part of the "CDN" you hit), I guess they must have stabilized it at some point. Here is an old issue before GitHub had a community issue tracker:
https://github.com/isaacs/github/issues/1483
I am glad this is getting more attention, maybe now github will finally have a stable endpoint for archives.
[1] https://github.com/elesiuta/picosnitch/blob/master/.github/w...
==> Validating source files with b2sums...
labwc-0.6.1.tar.gz ... FAILED
==> ERROR: One or more files did not pass the validity check!Surely, Microsoft-Github's own internal builds would have started failing as a result of this change? Or do they not even canary releases internally at all?
"didn't read every commit in new version of git, realized after the fact"
[1] https://github.com/bazelbuild/rules_jvm_external/releases/ta...
[2] https://github.com/bazelbuild/rules_python/releases/tag/0.17...
[3] https://github.com/bazelbuild/rules_java/releases/tag/5.4.0
On the other hand this goes against the "verify before parse" principle so I have mixed feelings on Nix's approach.
It's the downstream tooling ( i.e. all the builds and package managers ) that need to clean their act up.
POTUS issued an EO and NIST have been following up, leading to the promotion of schemes such as spdx https://tools.spdx.org/app/about/
Where I work is also required to start documenting our supply chain as part of the (new, replacing PCI-DSS) PCI-SFF certification requirements, which requires end-to-end verification of artifacts that are deployed within PCI scope.
So really, the arguments about CPU time etc are basically silly. The use of SHA hashes for artifacts that don't change will be a requirement for anyone building industrial software, or supplying to government, or in the money transacting business.
However, I do think it's a bad idea to enforce the content of compressed archives to be deterministic. tar has never specified an ordering of its contents. Compression algorithms are parameterized for time and space, so their output should not be deterministic either. Both of these principles apply to zip as well. But we now have a situation where we are depending on both the archive format and the compression algorithm to produce a deterministic output. If we expect archives to behave this way in general, we set a bad precedent for all sorts of systems, not just git and GitHub.
Tar/zipball archives on the same ref never have a stable hash.
Forever problem 1:
No sha256/512/3 hashes of said tar/zipballs.
Forever problem 2:
No metalinks for those.
Forever problem 3:
Not IPv6. Some of our network is IPv6 only.
Forever problem 4:
Hitting secondary rate limiting because I can browse fast.
You can try it online here:
and relies on checksumming ephemeral artefacts for integrity.
GitHub unilaterally made that decision for their own convenience, and violated a decades-long universal community norm in the process.
Anyone remember the crazyness when Homebrew had problems with using GitHub for the same thing?
files uploaded to GH Packages are not modified by GitHub.
only the "Source Code (.zip)" and "Source Code (.tgz)" files that are part of releases and tags are affected because git generates them on demand, and git does not guarantee hash stability.
if you upload a package to GH Packages or upload a release asset to a GitHub releases those are never modified, and you can rely on those hashes.
GitHub chooses to do this. It's GitHub's choice to generate Source Code files on demand rather than when the release is made. It's a way of reducing their disk usage at the cost of this kind of potential problem.
The problem is they also presented it as if it was a stable reference. If people knew it was not stable they would have done what the Bazel devs are now talking about doing, which is also uploading the source code at release time, as an artifact (which is how it works on Nexus).
Keep it simple, just vendor your deps.
Github has pretty much a one-click ( or one API call ) workflow to create properly versioned and archived tarballs. Just because lots of people try to skirt proper version management doesn't mean you should commit the world into your repo
How it’s done in Chromium: <https://source.chromium.org/chromium/chromium/src/+/main:thi...>.
https://github.com/freebsd/freebsd-ports/commit/a43ec88422ee...