Git archive checksums may change (opens in new tab)

(github.blog)

245 pointsmcovalt3y ago240 comments

240 comments

Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).

Also posted here: https://github.com/bazel-contrib/SIG-rules-authors/issues/11...

rsc3y ago

Thanks for the quick rollback.

I want to encourage you to think about locking in the current archive details, at least for archives that have already been served. Verifying that downloaded archives have the expected checksum is a critical best practice for software supply chain security. Training people to ignore checksum changes is training them to ignore attacks.

GitHub is a strong leader in other parts of supply chain security, and it can lead here too. Once GitHub has served an archive with a given checksum, it should guarantee that the archive has that checksum forever.

matthewcroughan3y ago

I've just had a thought. When GitHub do update the hashing for better compression, everyone relying on the tar hash will update their hashes. This is the ultimate opportunity to change the tar contents, effect the supply chain, introduce vulnerabilities, and have everyone trust you. Something like Nix which computes the NAR Hash (the result of the tar contents) will not be effected by this, since it only cares about the content. I think this is much better than worrying about an unlikely tar vulnerability. In a system that only trusts the tar hashes, the original source is not able to take advantage of better compression over time, without massive risk of supply chain attack. If you think you can hand me a tarball that can run arbitrary code, for any version of tar that has ever existed, please give it to me so I can experiment with exploits, and I'll buy you a drink of your choice at FOSDEM if you're there!

rsc3y ago

You're not wrong, but you're also not being realistic.

Nix is not the only system that takes this approach. The Go modules "directory hash" is roughly equivalent, although we defined it in terms of somewhat more standard tooling: it is the output of

    sha256sum $(find . -type f | sort) | sha256sum

I am not here advocating that everyone switch to this basic directory hash either, because it's not a solution to the more general problem that many systems are solving, namely validating _any_ downloaded file, not just file archives.

There are widespread, standard tools to run a SHA256 over a downloaded file, and those tools work on _any_ downloaded file. Essentially every programming language ships with or has easily accessible libraries to do the same. In contrast, there are not widespread, standard tools or libraries for the "NAR Hash" nor the Go "directory hash". Even if there were, such tools would need to be able to parse every kind of file that people might be downloading as part of a build, not just tar files.

It's a good solution in limited cases such as Nix and Go modules, but it's not the right end-to-end solution for all cases.

1 more reply

bentley3y ago

I would also appreciate stronger advertising of the ability to turn a Git tag into a GitHub release and upload stable source code files to it. Maybe even a button in the GitHub releases interface to “generate source tarball and attach as stable tarball to this release.”

misnome3y ago

But this isn’t a great solution, because afterwards there is now three, or four source download links, some of which are stable.

Not to mention, forcing people to use GitHub releases instead of just tags (which excludes every mirror of somewhere else)

mathstuf3y ago

I agree this would be great. However, it should also stop you from providing useless tarballs (as `/archive/` does today) if:

- you use autoconf (or any other tool(s) that require generating code into the source archive; or - you have submodules (to which `git archive` is completely blind).

Note that `git-archive-all`[1] can help as long as your submodules don't do things like `[attr]custom-attr` in their `.gitattributes` as it is only allowed in the top-level `.gitattributes` file and cannot be added to the tree otherwise.

[1]https://github.com/roehling/git-archive-all

1 more reply

matthewcroughan3y ago

https://floxdev.com/blog/hash-collision

vtbassmatt3y ago

We updated our Git version which made this change for the reasons explained. At the time we didn't foresee the impact. We're quickly rolling back the change now, as it's clear we need to look at this more closely to see if we can make the changes in a less disruptive way. Thanks for letting us know.

phphphphp3y ago

Consumers often mistake hasn’t changed for a commitment to never change: any sufficiently large product will be littered with these kind of implicit commitments made by the product to consumers that nobody has visibility into. You’re unfortunate that we were all relying on this commitment you’ve never made, but the quick reversion is the best we can hope for. People will theorise how this could have been avoided but c’est la vie — easy mistake that you’ve responded well to.

dharmab3y ago

Hyrum's Law:

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

2 more replies

nickitolas3y ago

FWIW according to https://github.com/bazel-contrib/SIG-rules-authors/issues/11... a commitment was made, although in an exchange in some support ticket, and not in documentation.

VWWHFSfQ3y ago

At this point they'll be stuck on old git for all of eternity unless they just roll their own archive/compress step out of band so the old hashes still work. Yikes.

2 more replies

mdouglass3y ago

We are seeing an npm install failure inside our docker builds pointing at a github URL with a SHA change. Is this possibly related?

  #15 [dev-builder 4/7] RUN --mount=type=secret,id=npm,dst=/root/.npmrc npm ci
  #0 4.743 npm WARN deprecated querystring@0.2.0: The querystring API is considered Legacy. new code should use the URLSearchParams API instead.
  #0 8.119 npm WARN tarball tarball data for http2@https://github.com/node-apn/node-http2/archive/apn-2.1.4.tar.gz (sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ==) seems to be corrupted. Trying again.
  #0 8.164 npm ERR! code EINTEGRITY
  #0 8.169 npm ERR! sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== integrity checksum failed when using sha512: wanted sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== but got sha512-GWBlkDNYgpkQElS+zGyIe1CN/XJxdEFuguLHOEGLZOIoDiH4cC9chggBwZsPK/Ls9nPikTzMuRDWfLzoGlKiRw==. (72989 bytes)
  #0 8.176 
  #0 8.177 npm ERR! A complete log of this run can be found in:
  #0 8.177 npm ERR!     /root/.npm/_logs/2023-01-30T23_19_36_986Z-debug-0.log
  #15 ERROR: process "/bin/sh -c npm ci" did not complete successfully: exit code: 1

This was working earlier today and the docker build/package.json haven't changed.

andrewguenther3y ago

Yes, this is the exact issue being described

mdouglass3y ago

That's what I thought, but I assumed with the rollback an hour plus ago, it wouldn't still be happening. That was off a build just a few minutes ago (actually repeated it in between the time I posted my original message and this reply and it happened again).

1 more reply

voidbip3y ago

Just want to second this. Still seeing an issue in our build right now that seems related.

``` Building aws-sdk-cpp[core,dynamodb,kinesis,s3]:x64-linux... -- Downloading https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... -> aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz... [DEBUG] To include the environment variables in debug output, pass --debug-env [DEBUG] Feature flag 'binarycaching' unset [DEBUG] Feature flag 'manifests' = off [DEBUG] Feature flag 'compilertracking' unset [DEBUG] Feature flag 'registries' unset [DEBUG] Feature flag 'versions' unset [DEBUG] 5612: popen( curl --fail -L https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... --create-dirs --output /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.ta r.gz.5612.part 2>&1) [DEBUG] 5612: cmd_execute_and_stream_data() returned 0 after 12643779 us Error: Failed to download from mirror set: File does not have the expected hash: url : [ https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... ] File path : [ /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz.5612.part ] Expected hash : [ 9b7fa80ee155fa3c15e3e86c30b75c6019dc1672df711c4f656133fe005f104e4a30f5a99f1c0a0c6dab42007b5695169cd312bd0938b272c4c7b05765ce3421 ] Actual hash : [ 503d49a8dc04f9fb147c0786af3c7df8b71dd3f54b8712569500071ee24c720a47196f4d908d316527dd74901cb2f92f6c0893cd6b32aaf99712b27ae8a56fb2 ] ```

kris-nova3y ago

Thanks for the update! There is only 1 internet to watch and learn from. We are all in this together. <3

1 more reply

denom3y ago

In my particular use-case, I'm using a set of local dev tools hosted as a homebrew tap.

The build looks up the github tar.gz release for each tag and commits the sha256sum of that file to the formula

What's odd is that all the _historical_ tags have broken release shasums. Does this mean the entire set of zip/tar.gz archives has been rebuilt? That could be a problem, as perhaps you cannot easily back out of this change...

lozenge3y ago

They never really stored them, they were always generated by some code (maybe with a cache layer in front). The code changed in a way that changed the bytes in the tar.gz without affecting their contents-when-extracted.

crote3y ago

The trick here is that a Github release is in essence simply a tag of a specific commit. There is no need to build archives in advance, as they can be dynamically generated from the git repo.

However, if you change the compression algorithm used to generate the archive, it'll result in a different checksum! The content is the same, but the archive is not.

Denvercoder93y ago

> Does this mean the entire set of zip/tar.gz archives has been rebuilt?

They are probably generated on-demand (and cached) from the Git repository, not prebuilt.

scyrybdis3y ago

I think the zip/tar.gz archives are being created on the fly when you download them, probably with a caching layer in front.

tinus_hn3y ago

Pretty bizarre this ever was stable in the first place.

Unfortunately for this kind of service you need to actively fiddle with the bytes to prevent people from relying on an implementation detail like this and prevent them from digging you into a too big to fail api stability hole.

1 more reply

vlovich1233y ago

Hyrum's Law strikes again. It kind of doesn't matter what you document. If you weren't randomizing your checksum previously [1], you can't just spring this on the community and blame it for the fallout. I'm more shocked that there's resistance from the GitHub team saying "but we documented this isn't stable". Default stance for the team should be rollback & reevaluate an alternate path forward when the scope is this wide (e.g. only generating the new tarballs for future commits going forward).

[1] Apparently googlesource did do this and just had people shift to using GitHub mirrors to avoid this problem.

blueflow3y ago

But look at it from the other side. Users that don't read your documentation and expect your software to work like they imagined are just a huge pain in the ass.

vlovich1233y ago

Fact of life: the vast majority of your users do not read your documentation (or do not do so carefully enough that what you put in your docs is an ironclad proof that all users adhere to). That's literally what Hyrum's law is about. Of course, you can choose to do whatever you want. It's valuable to recognize of course that you're trading off good will from your users with whatever technical improvement is getting made. Sometimes it's appropriate and inevitable (e.g. old behavior is just wrong or harmful and better to cut off). In the vast majority of cases though it's better to just have a better process in place to manage this with minimal disruption, identifying and communicating with broken users, and only then making that change.

blueflow3y ago

Thats support you could expect if you paid for it.

2 more replies

ZephyrBlu3y ago

You just described >90% of users. Everyone does this for something, most people do it for most things.

You minimally read the docs, get something working and then leave it alone. Of course you're going to be pissed off when an implicit assumption which has been stable for a long time is broken.

grepfru_it3y ago

>Of course you're going to be pissed off when an implicit assumption which has been stable for a long time is broken.

This accurately describes my beef with golang

missingdays3y ago

Yes, but if you implement the checksum algorithm for GitHub archives, shouldn't you read the documentation about archives checksum?

1 more reply

dataflow3y ago

I don't think expecting users to go look for a user manual on each website whose links they download from is a realistic expectation.

blueflow3y ago

Worse, you can't expect other people to host your data for free, forever. If you want your data distributed, you need to check first if the platform is suitable for your purposes.

1 more reply

lupire3y ago

If you don't want users, feel free to ignore them.

throwawaylinux3y ago

If your product supports some particular behavior, it will be used regardless of what you document.

Microsoft was once renown for bug-compatibility so as not to break their users. The new wave of movers and breakers would forget that wisdom at their peril.

mr_toad3y ago

Give a man a fish and he’ll assume he’s entitled to a lifetime supply of free fish.

dataflow3y ago

This has nothing to do with free vs. paid? The question is whether giving someone 99 of the same fish entitles them to expect the 100th one you throw in to be the same kind of fish, whether they paid for it or not.

kkirsche3y ago

This. You have to draw the line somewhere. Was this specific choice that line? Maybe not, but sometimes users aren’t right and changes just need to occur to ensure other asks from the same users can be delivered.

ilyt3y ago

I'd imagine they broke their own stuff doing it, considering npm broke on it

KyeRussell3y ago

Do you work for Google?

hobofan3y ago

This isn't even a case of "we didn't documented this".

I know that the Bazel team reached out to GitHub in the past to get a confirmation that this behaviour could be relied on, and only after that was confirmed did they set that as recommendation across their ecosystem.

nilsbunger3y ago

This is especially true of something like a git SHA, which is drilled into your head as THE stable hash of your code and git tree at a certain state. It should be expected that lots of tools use it as an identifier -- heck, I've done so myself to confirm which version of a piece of software is deployed on a particular machine, etc.

Denvercoder93y ago

The Git commit hashes didn't change (that'd actually be a serious problem). The hash of a compressed archive of the contents of a Git commit changed.

c4mpute3y ago

Yes, but not in this bug. I guess lots of people missed that distinction: The stable git SHA hash is the commit hash, which is an hash over gits internal representation of the commit object (containing a tree of all file hashes, and parents' hashes).

The hash that pops out of 'git archive' has nothing whatsoever to do with the commit hash and was historically stable more or less by accident: git feeds all files to 'tar' in tree order (which is fixed) and (unless you specify otherwise) always uses gzip with the same options. Since they no longer use gzip but an internal call to zlib, compression output will look different but will still contain the same tar inside.

That people have relied on this archive hash being stable is an indication of a major problem imho, because it might mean that people in their heads project integrity guarantees from the commit hash (which has such guarantees) onto the archive hash (which doesn't have those guarantees). I would suggest randomizing the archive hash on purpose by introducing randomness somewhere, so that people no longer rely on it.

thirtyseven3y ago

The people that this broke weren't directly depending on the output of git archive being stable, but were assuming that the response data for a particular URL would stay constant. Maybe not a great idea either but not entirely unreasonable IMO.

nilsbunger3y ago

Oh interesting. But if an archive hash isn’t stable, how is it meant to be used? What’s it good for?

1 more reply

vlovich1233y ago

To be fair this isn't the git SHA. This is the generated archive (apparently dynamically per request) when you ask for a source tarball.

daniealapt3y ago

https://xkcd.com/1172/

sneak3y ago

It's Microsoft. Just as the Apple of today is not the Apple of ten years ago, the GitHub today is not the GitHub of ten years ago. It's literally different people.

The people who made the things you love have mostly moved on, and the brand is being run by different people with different values now.

There's a little bit of an argument that such things are a bait-and-switch, but such is the nature of a large and multigenerational corporation.

naikrovek3y ago

The Microsoft of today isn't the Microsoft of 10 years ago, either, but that doesn't stop anyone from assuming that today's Microsoft is the same as the Microsoft of 10 years ago.

the logic people use to blame Microsoft is intense, man. literally any logical leap is valid except one that absolves Microsoft of anything, no matter how small.

katbyte3y ago

Trust is lost quickly and easily and earned back slowly with great difficulty

1 more reply

lucb1e3y ago

I didn't even know I should be depending on compression, file ordering, created-at file metadata, etc. being stable when pressing 'download repository as zip' (if I understand correctly what this is about, since the article doesn't really say). Perhaps it could be stable due to caching for a while after you first press it, but when it gets re-generated? I'm very surprised this was reproducible to begin with, given how much trouble other projects have with that.

For projects where I verify the download, gpg seems to be what all of them use (thinking of projects like etesync and restic here). Interesting that so many people relied on a zip being generated byte-for-byte identically every time.

slaymaker19073y ago

I once had a small issue with a deployment at work because of ordering issues within a zip file. That order is important with Spring since that determines which classes are initialized first.

groestl3y ago

One of the first things I check with every jvm packaging/deployment tool I investigate: does it preserve classpath ordering. Some offenders think -jar * is enough.

rfoo3y ago

> gpg seems to be what all of them use

GPG signs a hash of the message with the private key, and you verify that the signature matches the file hash.

Oh wait, what hash? :clown:

leoh3y ago

Many tools set mtime to zero to avoid checksum drift

philipwhiuk3y ago

There are lots of methods to solve this problem - I imagine this was just easiest at the time given it appeared to work. Bazel devs on the list are discussing the best approach going forward - a simple change is to upload a fixed copy as a release artifact.

frankjr3y ago

GitHub will need to revert this change. They've just crippled pretty much every "from source" package manager out there.

metrognome3y ago

Per the post, this was a change to git itself: https://github.com/git/git/commit/4f4be00d302bc52d0d9d5a3d47...

forgotpwd163y ago

What was the thought behind this change?

georgyo3y ago

If you read the commit message you would see that it is up drop a third party dependency.

1 more reply

fweimer3y ago

They could just produce tar output and compress that using system gzip. The “git archive” tool supports many output formats.

acdha3y ago

If those tools incorrectly assume an API contract which doesn't exist, isn't the right answer to fix those tools?

kentonv3y ago

In theory, sure, that's what we'd do in an ideal world.

In the real world it will take millions of dollars of eng labor just to update the hashes to fix everything that's currently broken and millions more to actually implement something better and move everyone over to it.

This isn't worth it, GitHub needs to just revert the change and then engineer a way to keep hashes stable going forward.

groestl3y ago

"The amount of work done “out there” on hundreds or thousands of applications for a single little libcurl tweak can be enormous. The last time we bumped the ABI, we got a serious amount of harsh words and critical feedback and since then we’ve gotten many more users!"

kzrdude3y ago

I know it's superficial but I think the problem would have been reduced if they used a download URL that looked like github.com/archive.php?project=rust&version=deadbeef it's just something that sends a signal and a different expectation on the same artifact.

kzrdude3y ago

Well, Github presents a file that looks like it comes from a file server, an old "ftp" archive or so. So they model it on that. Already published versions and tar balls should not change in those systems.

I think everyone knows these files are generated on the fly, but it comes from old habits.

nick__m3y ago

I prefer that tool be adapted to be more resilient and not depend on github particular implementation.

swarfield3y ago

Using SHA hashes when building guarantees that the code that you are building is what you think it is. How else would you verify dependencies like this, GPG signatures would have the same issue if you change the underlying bits.

Denvercoder93y ago

I wouldn't check the hash of the compressed archive, but of the actual files themselves. It's a bit more metadata, but it's also a lot more robust, and allows you to detect changes after unpacking as well.

1 more reply

shakow3y ago

By checking the hash of the extracted files. The hash of the archive is dependent on the order in which the file were compressed, the compression, some metadata, etc.

1 more reply

ArchOversight3y ago

a git checkout of the code at that particular tag hasn't changed. Just the tarball that git archive generates has.

2 more replies

ErikCorry3y ago

This seems like a weak argument.

Firstly SHA is not a secure hash.

Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.

What am I missing?

4 more replies

Zababa3y ago

They're all waiting for your pull requests.

naikrovek3y ago

the change was to git, not GitHub.

nick__m3y ago

Sorry, I missread the Github annonce and incorrectly interpreted it.

pxc3y ago

Nixpkgs' so-called binary cache actually also caches source tarballs. Any Nix users out there who ran updates during the change?

Did cache hits save you? Did cache misses break your builds?

anderskaseorg3y ago

Nixpkgs’s fetchFromGitHub function hashes the contents of GitHub archives after unpacking, so it’s unaffected.

pxc3y ago

I should have remembered this! Nixpkgs committers are consistently mindful of things like this in code reviews.

clhodapp3y ago

I could be wrong but believe that nix should be safe for the most part because it does a recursive hash of the stuff it cares about on the extraction of these archives.

jkachmar3y ago

didn’t realize this had happened until i logged off of my work computer & saw someone had shared this thread in a group chat.

looks like we were completely unaffected, as no one made any updates to derivations referencing GitHub sources in a way that invalidated old entries (i.e. no version bumps, new additions, etc.).

WayToDoor3y ago

https://github.com/orgs/community/discussions/45830#discussi...

> Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).

skobovm3y ago

I wonder what monetary loss in productivity was due to this change. We noticed this issue a bit before noon, tracked it down to GH, sent out company-wide comms notifying others of the problem, filed tickets with GH, had to modify numerous repos across multiple teams, and now it's 3pm and I'm here reading about it.

It's crazy how such a seemingly innocuous change, like this, could lead to such widespread loss in productivity across the globe.

misnome3y ago

Our conda-forge package builds broke. We had someone declare to us that tag downloads were never stable, just releases. This seems to be the opposite of the known truth about the previous status quo - but does go some way to demonstrating how little the state of the actual guarantees for this system were understood.

wildfire3y ago

See https://github.com/orgs/community/discussions/45830 for the fallout.

kelnos3y ago

The thing I don't get is how this ever worked.

The change was upstream from git itself, and it was to use the builtin (zlib-based) compression code in git, rather than shelling out to gzip.

But would the gzip binary itself give reproducible results across versions of gzip (and zlib)? Intuition seems to suggest it wouldn't, at least not always. And if not, was the "strategy" just to never update gzip or zlib on GitHub's servers? That seems like a non-starter...

FeepingCreature3y ago

gzip is 28 years old. I don't think the output changes anymore.

account423y ago

There is no reason to believe that it won't. Even after 28 years, there could be improvements merged for the compressor. Or perhaps especially after 28 years - we have a lot more memory now but it is slower when compared to our CPUs than it used to be so there is most likely room for tuning. Similar for patches that make use of newer CPU instructions - why would you expect them to take care to produce the exact same output rather than just the best compression possible for a perf budget.

ihattendorf3y ago

That's the whole point, it wasn't an enforced contract but just happened to not change in a long time so it was assumed to be part of the contract. The majority of users don't know how exactly GitHub is serving these archives, they just assume (incorrectly, but reasonably) if they download from this URL they'll always get the same archive bit for bit. That assumption has grown stronger and stronger over time the longer they remained the same, until today.

jzelinskie3y ago

Does anyone have the motivation for why the git project wants to use their own implementation of gzip? Did this implementation already exist and was being used for something else?

I understand wanting fewer dependencies, but gut-reaction is that it's a bad move in the unsafe world of C to rewrite something that already has a far more audited, ubiquitous implementation.

nemetroid3y ago

They're still using zlib to do the heavy lifting. It's not a large patch.

https://public-inbox.org/git/1328fe72-1a27-b214-c226-d239099...

capableweb3y ago

> So the internal implementation takes 17% longer on the Linux repo, but

> uses 2% less CPU time. That's because the external gzip can run in

> parallel on its own processor, while the internal one works sequentially

> and avoids the inter-process communication overhead.

> What are the benefits? Only an internal sequential implementation can

> offer this eco mode, and it allows avoiding the gzip(1) requirement.

It seems like they changed it because it uses less CPU, which makes sense in a "we're a global git hosting company" perspective, but less so for users who run the command themselves. They intentionally made it 17% slower to save 2% of CPU time, which probably makes sense at their scale, but for every user who run the command locally to lose 17% more of time?

Twirrim3y ago

This was a change in the upstream git project, I don't think it came from GitHub necessarily?

Looks like the author is the maintainer of "Git for Windows", and similar, which I can imagine makes for a reasonable argument for reducing dependencies. zlib is already a library dependency, just use that instead of needing people to bundle up a gzip binary along with git, too.

https://lore.kernel.org/git/pull.145.git.gitgitgadget@gmail....

pixl973y ago

Because they pay for the 2% CPU time, not for the 17% local time. In theory the user also pays for 2% less CPU time, but they are much less likely to be CPU limited in their build processes.

Of course 17% more time may not really be that much for most processes. Are we talking about 17% more of a second or of an hour?

jeffbee3y ago

It seems like if they really wanted to save CPU they'd be caching the outputs. I fail to see why they would be recompressing years-old release tags. This seems like optimization at the wrong level.

That's without even mentioning the absurdity of saving 2% CPU but still using zlib.

semiquaver3y ago

“Their own” implementation is just zlib, already in use throughout git since the dawn of the project for other purposes like blob storage [1].

Depending on how you measure it, zlib might be considered significantly more ubiquitous than gzip itself. At any rate it’s certainly no less battle tested.

[1] https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

groestl3y ago

I think "Drop the dependency on gzip" for something like Git trumps a bit more exposure (which can be mitigated with thorough reviews).

Aissen3y ago

It was publicly known that Github was breaking automatic git archives consistency for many years. Here is a bug on a project to stop relying on fake github archives (as opposed to stable git-archive(1)):

https://bugzilla.tianocore.org/show_bug.cgi?id=3099

At some point it was impossible to go a few weeks (or even days) without a github archive change (depending on which part of the "CDN" you hit), I guess they must have stabilized it at some point. Here is an old issue before GitHub had a community issue tracker:

https://github.com/isaacs/github/issues/1483

I am glad this is getting more attention, maybe now github will finally have a stable endpoint for archives.

doubleunplussed3y ago

Ah, this will presumably break some Arch Linux AUR packages. Preparing for bug reports.

elesiuta3y ago

I always anticipated something like this could happen and it bothered me enough to create my own workflow [1] to archive, hash, and attach it to each release automatically for my AUR package. I can see how most people wouldn't notice/bother with such a small detail though, so I am not at all surprised by the fallout this caused.

[1] https://github.com/elesiuta/picosnitch/blob/master/.github/w...

frankjr3y ago

Yep, it has already broken labwc for me.

    ==> Validating source files with b2sums...
        labwc-0.6.1.tar.gz ... FAILED
    ==> ERROR: One or more files did not pass the validity check!

lopkeny12ko3y ago

I can't fathom how no one internally at Microsoft-Github realized how widespread the breakage would be before rolling this out to all public users.

Surely, Microsoft-Github's own internal builds would have started failing as a result of this change? Or do they not even canary releases internally at all?

ilyt3y ago

I can

"didn't read every commit in new version of git, realized after the fact"

medellin3y ago

Im thinking of all the bazel build rules that are about to break from my last company. Someone will have a fun day updating hundreds of hashes.

ErikCorry3y ago

Do they let Github generate the archives as one of the build rules instead of performing the archival and compression locally and uploading the result?

medellin3y ago

Correct. Silly stuff like this happens when you don’t have systems in place that make it easy to store your own artifacts. Additionally a lot of people just want to get things done as quick as possible even if you have the tools in place.

jart3y ago

If they're using multiple URLs like a good Bazel user then they shouldn't be impacted.

thirtyseven3y ago

The setup instructions for almost [1] every [2] major [3] rule set [4] only provide one (GitHub) url in the Starlark blob you're supposed to copy and paste, so hard to blame users here.

[1] https://github.com/bazelbuild/rules_jvm_external/releases/ta...

[2] https://github.com/bazelbuild/rules_python/releases/tag/0.17...

[3] https://github.com/bazelbuild/rules_java/releases/tag/5.4.0

[4] https://github.com/bazelbuild/rules_scala

jart3y ago

I agree. The Bazel developers failed in their leadership.

1 more reply

medellin3y ago

They did where applicable but i know that not all of them had multiple

jart3y ago

Well now they know why it's so important. https://github.com/bazelbuild/bazel/commit/ed7ced0018dc5c5eb...

UncleOxidant3y ago

Lol... I was being burned by this just about an hour ago. Cloned a repo, did a build of the project (which uses bezel to fetch dependencies) and it reported errors due to mismatch in expected checksums.

hamandcheese3y ago

The fact that this is causing problems seems like a flaw in Bazel, imo. Nix, for example, calculates a hash of the contents of a tarball, rather than a hash of the tarball itself.

rfoo3y ago

Yep, Nix not affected at all is pretty impressive.

On the other hand this goes against the "verify before parse" principle so I have mixed feelings on Nix's approach.

Foxboron3y ago

They don't really do any source authentication at all. There is no strategy for checking gpg/minisign/whatever signatures and fetching keys to validate these things.

ArchOversight3y ago

I remember a similar breakage happening before due to internal git changes, and thought it was common knowledge to upload your own signed tarballs for releases.

rektide3y ago

Now please give us compression options beyond gzip? :) Some zstd & lz4 please?

metrognome3y ago

I wonder if this incident will encourage our industry to build more robust forms of artifact integrity verification, or if we will instead codify the status quo of "we guarantee repos to be archived deterministically." To me, the latter seems like a more troubling precedent.

bentley3y ago

We’ve regressed from the previous norm of open source projects providing stable source tarballs with fixed checksums, sometimes even with cryptographic signatures.

reindeerer3y ago

That norm still exists, and it's offered by Github in form of Github Releases feature as well.

It's the downstream tooling ( i.e. all the builds and package managers ) that need to clean their act up.

JonChesterfield3y ago

If the source tar changes, how do you propose the downstream tooling distinguishes between data corruption, MITM attack and upstream deciding to change the number without notifying anyone?

1 more reply

rswail3y ago

This is being driven in industry by the push by US FedGov (via NIST) to have supply chain verification after the recent hacks.

POTUS issued an EO and NIST have been following up, leading to the promotion of schemes such as spdx https://tools.spdx.org/app/about/

Where I work is also required to start documenting our supply chain as part of the (new, replacing PCI-DSS) PCI-SFF certification requirements, which requires end-to-end verification of artifacts that are deployed within PCI scope.

So really, the arguments about CPU time etc are basically silly. The use of SHA hashes for artifacts that don't change will be a requirement for anyone building industrial software, or supplying to government, or in the money transacting business.

metrognome3y ago

Oh, I'm not arguing that using checksums, SHA for example, for integrity verification is a bad idea. That's what they're designed for, after all.

However, I do think it's a bad idea to enforce the content of compressed archives to be deterministic. tar has never specified an ordering of its contents. Compression algorithms are parameterized for time and space, so their output should not be deterministic either. Both of these principles apply to zip as well. But we now have a situation where we are depending on both the archive format and the compression algorithm to produce a deterministic output. If we expect archives to behave this way in general, we set a bad precedent for all sorts of systems, not just git and GitHub.

swarfield3y ago

https://github.com/bazel-contrib/SIG-rules-authors/issues/11...

1letterunixname3y ago

Forever problem 0:

Tar/zipball archives on the same ref never have a stable hash.

Forever problem 1:

No sha256/512/3 hashes of said tar/zipballs.

Forever problem 2:

No metalinks for those.

Forever problem 3:

Not IPv6. Some of our network is IPv6 only.

Forever problem 4:

Hitting secondary rate limiting because I can browse fast.

fomine33y ago

I haven't aware that git archive is reproducible

pabs33y ago

I note that diffoscope is useful for verifying which parts of git/other archives have changed:

https://diffoscope.org/

You can try it online here:

https://try.diffoscope.org/

swarfield3y ago

They have broken almost every open source project that builds external deps. Also broke homebrew apparently.

capableweb3y ago

Good test that the tooling actually works when the checksums are incorrect :) If your "build from source" tool/workflow DIDN'T break, I'd be worried.

groestl3y ago

> every open source project that builds external deps

and relies on checksumming ephemeral artefacts for integrity.

catiopatio3y ago

Source archives have never, in the entire history of open source, been considered ephemeral.

GitHub unilaterally made that decision for their own convenience, and violated a decades-long universal community norm in the process.

Denvercoder93y ago

You could also say that some maintainers made that decision for their convenience of not having to build and upload source archives. It is possible to upload your own artifacts to a release on GitHub, and lots of projects do. Those are correctly treated as immutable by GitHub.

2 more replies

mardifoufs3y ago

I think this change only affects automatically (and dynamically) generated source archives, not those that are actually pushed to Github Releases beforehand.

kzrdude3y ago

All these projects relying on github, they are using a free service they don't control. It could go away someday. That will be a bigger crisis than this was..

pxc3y ago

Such tools should definitely checksum package sources lol

robomc3y ago

Think this also broke github codespaces (the downloading of devcontainer "features").

jakeogh3y ago

Github support, please checkout: https://news.ycombinator.com/item?id=34606345

philipwhiuk3y ago

Yet another reason why GitHub is not a good Artifactory/Nexus replacement.

Anyone remember the crazyness when Homebrew had problems with using GitHub for the same thing?

naikrovek3y ago

this is a git behavior, not a GitHub behavior.

files uploaded to GH Packages are not modified by GitHub.

only the "Source Code (.zip)" and "Source Code (.tgz)" files that are part of releases and tags are affected because git generates them on demand, and git does not guarantee hash stability.

if you upload a package to GH Packages or upload a release asset to a GitHub releases those are never modified, and you can rely on those hashes.

philipwhiuk3y ago

No, it's not.

GitHub chooses to do this. It's GitHub's choice to generate Source Code files on demand rather than when the release is made. It's a way of reducing their disk usage at the cost of this kind of potential problem.

The problem is they also presented it as if it was a stable reference. If people knew it was not stable they would have done what the Bazel devs are now talking about doing, which is also uploading the source code at release time, as an artifact (which is how it works on Nexus).

naikrovek3y ago

> The problem is they also presented it as if it was a stable reference.

how? the docs state that the hashes of these files are not guaranteed to be stable.

the decision to generate those files on demand is a good one, provided that the behavior is documented, and it is.

others in this thread figured it out before this particular issue arose and made the necessary changes to their workflows so that their downloads would have stable hashes.

blcknight3y ago

Oh god I spent like an hour debugging why gpg wouldn’t recognize the signature of RVM (Ruby version manager)

forgotpwd163y ago

Can anyone explain what happened? Thing changed, things broke, and things changed back in less than an hour.

zoobab3y ago

Github devs cannot point to their git commit, because Github is not open source.

yakubin3y ago

Now I’m having a laugh at all those times someone tried to explain to me that vendoring dependencies doesn’t make sense, when you have package managers which verify checksums of the things downloaded from GitHub/wherever. A good laugh.

Keep it simple, just vendor your deps.

reindeerer3y ago

This is a false choice. "Vendoring" is much more of a mess than this is, and second, there's no reason to rely on these on the fly tarballs for anything, when proper versioned software releases exist.

Github has pretty much a one-click ( or one API call ) workflow to create properly versioned and archived tarballs. Just because lots of people try to skirt proper version management doesn't mean you should commit the world into your repo

DoctorNick3y ago

With what? The abomination that is `git submodules`?

yakubin3y ago

No. Just copy files into the repo. Any way you like. In a GUI, in a terminal — it doesn’t require a dedicated tool. Although cargo in Rust e.g. provides a dubcommand for it (cargo vendor). Alternatively you can host the tarballs somewhere you control in static storage — be it a static web server, object storage or whatever.

How it’s done in Chromium: <https://source.chromium.org/chromium/chromium/src/+/main:thi...>.

skobovm3y ago

Woof. At the rate packages get updated these days, and the amount of dependencies between them, that just isn't sustainable for any reasonably-sized project in server and -- especially -- frontend land.

1 more reply

rabexc233y ago

vendoring even with a tool has always worked poorly with me. Here are a few reasons:

1. You work in a company, you are in a team, you want some reasonable code review process in place. Now you want to check in a 3rd party dependency, "let's vendor it!" so you send out a PR with ... 10,000 - 100,000 lines of code. Your reviewer has no reasonable way to know if a) the dependency was downloaded from a reputable source, if b) the code was not modified maliciously, c) there was some local patch / local change either voluntarily added or accidentally added (maybe you tried running configure/make locally, and didn't realize that one .h file was generated from your machine. A diligent reviewer would have to re-download the source tarball from a reputable source (is there the url in the commit message? A README? better hope there is!), unpack it locally, generate the set of files and all hashes, compare with your PR. And ensure that the PR / vendored dependency comes with a README or METADATA file so the download URL and LICENSE is recorded for posterity.

2. Now you need to update the dependency. Either it's a new directory (so you vendor both versions), or you have to delete all files that are gone. The PR review will be worse, as it will show a diff, except the reviewer won't review it, except to repeat the steps in 1. Without considering patches applied in the mean time, as the code was simply checked in the repository, and anyone could easily change it.

3. For anything but small/tiny projects, the vendoring will take up most of the download / checkout time of your repository.

If you use git for vendoring, the problem is not significantly better: if you care about the integrity of the vendored code, you need to verify the final tree, or the log / hash / set of commits.

Compare to using a simple file with a 1) url, 2) secure hash, 3) list of patches to apply. Reviewing and ensuring correctness is trivial, upgrading is trivial, PRs are trivial.

To avoid problems like the github problem here, a simple proxy or local cache is enough, a tool that takes the hash (or a hash of a url) and reads it from disk, is good enough. And detects corruption.

1 more reply

SuperSandro20003y ago

Thats why nix unpacks the archives first and then hashes them.

gray_-_wolf3y ago

Did people not know this? Honest question. I did run into this few times already before this change, so I assumed this would be wide-spread knowledge and mirrored everything.

skobovm3y ago

How would anyone (outside of GH) have known this? The checksums have been stable for years, and this issue resulted from an internal update to the version of Git being used. It also was not publicized, until this ex post facto blog post

anecdotal13y ago

They have not been stable

https://github.com/freebsd/freebsd-ports/commit/a43ec88422ee...

mhitza3y ago

https://xkcd.com/1053/

daniealapt3y ago

Any change breaks a workflow - https://xkcd.com/1172/

capableweb3y ago

True, small percentage will always be impacted by even the tiniest of change. But this was not that, checksums all over the place started breaking, as lots of FOSS is hosted on GitHub and lots of infrastructure depends on checksums remaining the same, otherwise they error out (correctly).

j / k navigate · click thread line to collapse

240 comments

vtbassmatt3y ago

Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).

Also posted here: https://github.com/bazel-contrib/SIG-rules-authors/issues/11...

rsc3y ago

Thanks for the quick rollback.

matthewcroughan3y ago

rsc3y ago

You're not wrong, but you're also not being realistic.

Nix is not the only system that takes this approach. The Go modules "directory hash" is roughly equivalent, although we defined it in terms of somewhat more standard tooling: it is the output of

    sha256sum $(find . -type f | sort) | sha256sum

It's a good solution in limited cases such as Nix and Go modules, but it's not the right end-to-end solution for all cases.

1 more reply

bentley3y ago

misnome3y ago

But this isn’t a great solution, because afterwards there is now three, or four source download links, some of which are stable.

Not to mention, forcing people to use GitHub releases instead of just tags (which excludes every mirror of somewhere else)

mathstuf3y ago

I agree this would be great. However, it should also stop you from providing useless tarballs (as `/archive/` does today) if:

- you use autoconf (or any other tool(s) that require generating code into the source archive; or - you have submodules (to which `git archive` is completely blind).

[1]https://github.com/roehling/git-archive-all

1 more reply

matthewcroughan3y ago

https://floxdev.com/blog/hash-collision

vtbassmatt3y ago

phphphphp3y ago

dharmab3y ago

Hyrum's Law:

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

2 more replies

nickitolas3y ago

FWIW according to https://github.com/bazel-contrib/SIG-rules-authors/issues/11... a commitment was made, although in an exchange in some support ticket, and not in documentation.

VWWHFSfQ3y ago

At this point they'll be stuck on old git for all of eternity unless they just roll their own archive/compress step out of band so the old hashes still work. Yikes.

2 more replies

mdouglass3y ago

We are seeing an npm install failure inside our docker builds pointing at a github URL with a SHA change. Is this possibly related?

  #15 [dev-builder 4/7] RUN --mount=type=secret,id=npm,dst=/root/.npmrc npm ci
  #0 4.743 npm WARN deprecated querystring@0.2.0: The querystring API is considered Legacy. new code should use the URLSearchParams API instead.
  #0 8.119 npm WARN tarball tarball data for http2@https://github.com/node-apn/node-http2/archive/apn-2.1.4.tar.gz (sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ==) seems to be corrupted. Trying again.
  #0 8.164 npm ERR! code EINTEGRITY
  #0 8.169 npm ERR! sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== integrity checksum failed when using sha512: wanted sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== but got sha512-GWBlkDNYgpkQElS+zGyIe1CN/XJxdEFuguLHOEGLZOIoDiH4cC9chggBwZsPK/Ls9nPikTzMuRDWfLzoGlKiRw==. (72989 bytes)
  #0 8.176 
  #0 8.177 npm ERR! A complete log of this run can be found in:
  #0 8.177 npm ERR!     /root/.npm/_logs/2023-01-30T23_19_36_986Z-debug-0.log
  #15 ERROR: process "/bin/sh -c npm ci" did not complete successfully: exit code: 1

This was working earlier today and the docker build/package.json haven't changed.

andrewguenther3y ago

Yes, this is the exact issue being described

mdouglass3y ago

1 more reply

voidbip3y ago

Just want to second this. Still seeing an issue in our build right now that seems related.

kris-nova3y ago

Thanks for the update! There is only 1 internet to watch and learn from. We are all in this together. <3

1 more reply

denom3y ago

In my particular use-case, I'm using a set of local dev tools hosted as a homebrew tap.

The build looks up the github tar.gz release for each tag and commits the sha256sum of that file to the formula

lozenge3y ago

crote3y ago

The trick here is that a Github release is in essence simply a tag of a specific commit. There is no need to build archives in advance, as they can be dynamically generated from the git repo.

However, if you change the compression algorithm used to generate the archive, it'll result in a different checksum! The content is the same, but the archive is not.

Denvercoder93y ago

> Does this mean the entire set of zip/tar.gz archives has been rebuilt?

They are probably generated on-demand (and cached) from the Git repository, not prebuilt.

scyrybdis3y ago

I think the zip/tar.gz archives are being created on the fly when you download them, probably with a caching layer in front.

tinus_hn3y ago

Pretty bizarre this ever was stable in the first place.

1 more reply

vlovich1233y ago

[1] Apparently googlesource did do this and just had people shift to using GitHub mirrors to avoid this problem.

blueflow3y ago

But look at it from the other side. Users that don't read your documentation and expect your software to work like they imagined are just a huge pain in the ass.

vlovich1233y ago

blueflow3y ago

Thats support you could expect if you paid for it.

2 more replies

ZephyrBlu3y ago

You just described >90% of users. Everyone does this for something, most people do it for most things.

You minimally read the docs, get something working and then leave it alone. Of course you're going to be pissed off when an implicit assumption which has been stable for a long time is broken.

grepfru_it3y ago

>Of course you're going to be pissed off when an implicit assumption which has been stable for a long time is broken.

This accurately describes my beef with golang

missingdays3y ago

Yes, but if you implement the checksum algorithm for GitHub archives, shouldn't you read the documentation about archives checksum?

1 more reply

dataflow3y ago

I don't think expecting users to go look for a user manual on each website whose links they download from is a realistic expectation.

blueflow3y ago

Worse, you can't expect other people to host your data for free, forever. If you want your data distributed, you need to check first if the platform is suitable for your purposes.

1 more reply

lupire3y ago

If you don't want users, feel free to ignore them.

throwawaylinux3y ago

If your product supports some particular behavior, it will be used regardless of what you document.

Microsoft was once renown for bug-compatibility so as not to break their users. The new wave of movers and breakers would forget that wisdom at their peril.

mr_toad3y ago

Give a man a fish and he’ll assume he’s entitled to a lifetime supply of free fish.

dataflow3y ago

kkirsche3y ago

ilyt3y ago

I'd imagine they broke their own stuff doing it, considering npm broke on it

KyeRussell3y ago

Do you work for Google?

hobofan3y ago

This isn't even a case of "we didn't documented this".

nilsbunger3y ago

Denvercoder93y ago

The Git commit hashes didn't change (that'd actually be a serious problem). The hash of a compressed archive of the contents of a Git commit changed.

c4mpute3y ago

thirtyseven3y ago

nilsbunger3y ago

Oh interesting. But if an archive hash isn’t stable, how is it meant to be used? What’s it good for?

1 more reply

vlovich1233y ago

To be fair this isn't the git SHA. This is the generated archive (apparently dynamically per request) when you ask for a source tarball.

daniealapt3y ago

https://xkcd.com/1172/

sneak3y ago

It's Microsoft. Just as the Apple of today is not the Apple of ten years ago, the GitHub today is not the GitHub of ten years ago. It's literally different people.

The people who made the things you love have mostly moved on, and the brand is being run by different people with different values now.

There's a little bit of an argument that such things are a bait-and-switch, but such is the nature of a large and multigenerational corporation.

naikrovek3y ago

The Microsoft of today isn't the Microsoft of 10 years ago, either, but that doesn't stop anyone from assuming that today's Microsoft is the same as the Microsoft of 10 years ago.

the logic people use to blame Microsoft is intense, man. literally any logical leap is valid except one that absolves Microsoft of anything, no matter how small.

katbyte3y ago

Trust is lost quickly and easily and earned back slowly with great difficulty

1 more reply

lucb1e3y ago

slaymaker19073y ago

I once had a small issue with a deployment at work because of ordering issues within a zip file. That order is important with Spring since that determines which classes are initialized first.

groestl3y ago

One of the first things I check with every jvm packaging/deployment tool I investigate: does it preserve classpath ordering. Some offenders think -jar * is enough.

rfoo3y ago

> gpg seems to be what all of them use

GPG signs a hash of the message with the private key, and you verify that the signature matches the file hash.

Oh wait, what hash? :clown:

leoh3y ago

Many tools set mtime to zero to avoid checksum drift

philipwhiuk3y ago

frankjr3y ago

GitHub will need to revert this change. They've just crippled pretty much every "from source" package manager out there.

metrognome3y ago

Per the post, this was a change to git itself: https://github.com/git/git/commit/4f4be00d302bc52d0d9d5a3d47...

forgotpwd163y ago

What was the thought behind this change?

georgyo3y ago

If you read the commit message you would see that it is up drop a third party dependency.

1 more reply

fweimer3y ago

They could just produce tar output and compress that using system gzip. The “git archive” tool supports many output formats.

acdha3y ago

If those tools incorrectly assume an API contract which doesn't exist, isn't the right answer to fix those tools?

kentonv3y ago

In theory, sure, that's what we'd do in an ideal world.

This isn't worth it, GitHub needs to just revert the change and then engineer a way to keep hashes stable going forward.

groestl3y ago

kzrdude3y ago

I think everyone knows these files are generated on the fly, but it comes from old habits.

nick__m3y ago

I prefer that tool be adapted to be more resilient and not depend on github particular implementation.

swarfield3y ago

Denvercoder93y ago

1 more reply

shakow3y ago

By checking the hash of the extracted files. The hash of the archive is dependent on the order in which the file were compressed, the compression, some metadata, etc.

1 more reply

ArchOversight3y ago

a git checkout of the code at that particular tag hasn't changed. Just the tarball that git archive generates has.

2 more replies

ErikCorry3y ago

This seems like a weak argument.

Firstly SHA is not a secure hash.

What am I missing?

4 more replies

Zababa3y ago

They're all waiting for your pull requests.

naikrovek3y ago

the change was to git, not GitHub.

nick__m3y ago

Sorry, I missread the Github annonce and incorrectly interpreted it.

pxc3y ago

Nixpkgs' so-called binary cache actually also caches source tarballs. Any Nix users out there who ran updates during the change?

Did cache hits save you? Did cache misses break your builds?

anderskaseorg3y ago

Nixpkgs’s fetchFromGitHub function hashes the contents of GitHub archives after unpacking, so it’s unaffected.

pxc3y ago

I should have remembered this! Nixpkgs committers are consistently mindful of things like this in code reviews.

clhodapp3y ago

I could be wrong but believe that nix should be safe for the most part because it does a recursive hash of the stuff it cares about on the extraction of these archives.

jkachmar3y ago

didn’t realize this had happened until i logged off of my work computer & saw someone had shared this thread in a group chat.

looks like we were completely unaffected, as no one made any updates to derivations referencing GitHub sources in a way that invalidated old entries (i.e. no version bumps, new additions, etc.).

WayToDoor3y ago

https://github.com/orgs/community/discussions/45830#discussi...

> Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).

skobovm3y ago

It's crazy how such a seemingly innocuous change, like this, could lead to such widespread loss in productivity across the globe.

misnome3y ago

wildfire3y ago

See https://github.com/orgs/community/discussions/45830 for the fallout.

kelnos3y ago

The thing I don't get is how this ever worked.

The change was upstream from git itself, and it was to use the builtin (zlib-based) compression code in git, rather than shelling out to gzip.

FeepingCreature3y ago

gzip is 28 years old. I don't think the output changes anymore.

account423y ago

ihattendorf3y ago

jzelinskie3y ago

Does anyone have the motivation for why the git project wants to use their own implementation of gzip? Did this implementation already exist and was being used for something else?

I understand wanting fewer dependencies, but gut-reaction is that it's a bad move in the unsafe world of C to rewrite something that already has a far more audited, ubiquitous implementation.

nemetroid3y ago

They're still using zlib to do the heavy lifting. It's not a large patch.

https://public-inbox.org/git/1328fe72-1a27-b214-c226-d239099...

capableweb3y ago

> So the internal implementation takes 17% longer on the Linux repo, but

> uses 2% less CPU time. That's because the external gzip can run in

> parallel on its own processor, while the internal one works sequentially

> and avoids the inter-process communication overhead.

> What are the benefits? Only an internal sequential implementation can

> offer this eco mode, and it allows avoiding the gzip(1) requirement.

Twirrim3y ago

This was a change in the upstream git project, I don't think it came from GitHub necessarily?

https://lore.kernel.org/git/pull.145.git.gitgitgadget@gmail....

pixl973y ago

Because they pay for the 2% CPU time, not for the 17% local time. In theory the user also pays for 2% less CPU time, but they are much less likely to be CPU limited in their build processes.

Of course 17% more time may not really be that much for most processes. Are we talking about 17% more of a second or of an hour?

jeffbee3y ago

It seems like if they really wanted to save CPU they'd be caching the outputs. I fail to see why they would be recompressing years-old release tags. This seems like optimization at the wrong level.

That's without even mentioning the absurdity of saving 2% CPU but still using zlib.

semiquaver3y ago

“Their own” implementation is just zlib, already in use throughout git since the dawn of the project for other purposes like blob storage [1].

Depending on how you measure it, zlib might be considered significantly more ubiquitous than gzip itself. At any rate it’s certainly no less battle tested.

[1] https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

groestl3y ago

I think "Drop the dependency on gzip" for something like Git trumps a bit more exposure (which can be mitigated with thorough reviews).

Aissen3y ago

https://bugzilla.tianocore.org/show_bug.cgi?id=3099

https://github.com/isaacs/github/issues/1483

I am glad this is getting more attention, maybe now github will finally have a stable endpoint for archives.

doubleunplussed3y ago

Ah, this will presumably break some Arch Linux AUR packages. Preparing for bug reports.

elesiuta3y ago

[1] https://github.com/elesiuta/picosnitch/blob/master/.github/w...

frankjr3y ago

Yep, it has already broken labwc for me.

    ==> Validating source files with b2sums...
        labwc-0.6.1.tar.gz ... FAILED
    ==> ERROR: One or more files did not pass the validity check!

lopkeny12ko3y ago

I can't fathom how no one internally at Microsoft-Github realized how widespread the breakage would be before rolling this out to all public users.

Surely, Microsoft-Github's own internal builds would have started failing as a result of this change? Or do they not even canary releases internally at all?

ilyt3y ago

I can

"didn't read every commit in new version of git, realized after the fact"

medellin3y ago

Im thinking of all the bazel build rules that are about to break from my last company. Someone will have a fun day updating hundreds of hashes.

ErikCorry3y ago

Do they let Github generate the archives as one of the build rules instead of performing the archival and compression locally and uploading the result?

medellin3y ago

jart3y ago

If they're using multiple URLs like a good Bazel user then they shouldn't be impacted.

thirtyseven3y ago

The setup instructions for almost [1] every [2] major [3] rule set [4] only provide one (GitHub) url in the Starlark blob you're supposed to copy and paste, so hard to blame users here.

[1] https://github.com/bazelbuild/rules_jvm_external/releases/ta...

[2] https://github.com/bazelbuild/rules_python/releases/tag/0.17...

[3] https://github.com/bazelbuild/rules_java/releases/tag/5.4.0

[4] https://github.com/bazelbuild/rules_scala

jart3y ago

I agree. The Bazel developers failed in their leadership.

1 more reply

medellin3y ago

They did where applicable but i know that not all of them had multiple

jart3y ago

Well now they know why it's so important. https://github.com/bazelbuild/bazel/commit/ed7ced0018dc5c5eb...

UncleOxidant3y ago

hamandcheese3y ago

The fact that this is causing problems seems like a flaw in Bazel, imo. Nix, for example, calculates a hash of the contents of a tarball, rather than a hash of the tarball itself.

rfoo3y ago

Yep, Nix not affected at all is pretty impressive.

On the other hand this goes against the "verify before parse" principle so I have mixed feelings on Nix's approach.

Foxboron3y ago

They don't really do any source authentication at all. There is no strategy for checking gpg/minisign/whatever signatures and fetching keys to validate these things.

ArchOversight3y ago

I remember a similar breakage happening before due to internal git changes, and thought it was common knowledge to upload your own signed tarballs for releases.

rektide3y ago

Now please give us compression options beyond gzip? :) Some zstd & lz4 please?

metrognome3y ago

bentley3y ago

We’ve regressed from the previous norm of open source projects providing stable source tarballs with fixed checksums, sometimes even with cryptographic signatures.

reindeerer3y ago

That norm still exists, and it's offered by Github in form of Github Releases feature as well.

It's the downstream tooling ( i.e. all the builds and package managers ) that need to clean their act up.

JonChesterfield3y ago

If the source tar changes, how do you propose the downstream tooling distinguishes between data corruption, MITM attack and upstream deciding to change the number without notifying anyone?

1 more reply

rswail3y ago

This is being driven in industry by the push by US FedGov (via NIST) to have supply chain verification after the recent hacks.

POTUS issued an EO and NIST have been following up, leading to the promotion of schemes such as spdx https://tools.spdx.org/app/about/

metrognome3y ago

Oh, I'm not arguing that using checksums, SHA for example, for integrity verification is a bad idea. That's what they're designed for, after all.

swarfield3y ago

https://github.com/bazel-contrib/SIG-rules-authors/issues/11...

1letterunixname3y ago

Forever problem 0:

Tar/zipball archives on the same ref never have a stable hash.

Forever problem 1:

No sha256/512/3 hashes of said tar/zipballs.

Forever problem 2:

No metalinks for those.

Forever problem 3:

Not IPv6. Some of our network is IPv6 only.

Forever problem 4:

Hitting secondary rate limiting because I can browse fast.

fomine33y ago

I haven't aware that git archive is reproducible

pabs33y ago

I note that diffoscope is useful for verifying which parts of git/other archives have changed:

https://diffoscope.org/

You can try it online here:

https://try.diffoscope.org/

swarfield3y ago

They have broken almost every open source project that builds external deps. Also broke homebrew apparently.

capableweb3y ago

Good test that the tooling actually works when the checksums are incorrect :) If your "build from source" tool/workflow DIDN'T break, I'd be worried.

groestl3y ago

> every open source project that builds external deps

and relies on checksumming ephemeral artefacts for integrity.

catiopatio3y ago

Source archives have never, in the entire history of open source, been considered ephemeral.

GitHub unilaterally made that decision for their own convenience, and violated a decades-long universal community norm in the process.

Denvercoder93y ago

2 more replies

mardifoufs3y ago

I think this change only affects automatically (and dynamically) generated source archives, not those that are actually pushed to Github Releases beforehand.

kzrdude3y ago

All these projects relying on github, they are using a free service they don't control. It could go away someday. That will be a bigger crisis than this was..

pxc3y ago

Such tools should definitely checksum package sources lol

robomc3y ago

Think this also broke github codespaces (the downloading of devcontainer "features").

jakeogh3y ago

Github support, please checkout: https://news.ycombinator.com/item?id=34606345

philipwhiuk3y ago

Yet another reason why GitHub is not a good Artifactory/Nexus replacement.

Anyone remember the crazyness when Homebrew had problems with using GitHub for the same thing?

naikrovek3y ago

this is a git behavior, not a GitHub behavior.

files uploaded to GH Packages are not modified by GitHub.

only the "Source Code (.zip)" and "Source Code (.tgz)" files that are part of releases and tags are affected because git generates them on demand, and git does not guarantee hash stability.

if you upload a package to GH Packages or upload a release asset to a GitHub releases those are never modified, and you can rely on those hashes.

philipwhiuk3y ago

No, it's not.

naikrovek3y ago

> The problem is they also presented it as if it was a stable reference.

how? the docs state that the hashes of these files are not guaranteed to be stable.

the decision to generate those files on demand is a good one, provided that the behavior is documented, and it is.

others in this thread figured it out before this particular issue arose and made the necessary changes to their workflows so that their downloads would have stable hashes.

blcknight3y ago

Oh god I spent like an hour debugging why gpg wouldn’t recognize the signature of RVM (Ruby version manager)

forgotpwd163y ago

Can anyone explain what happened? Thing changed, things broke, and things changed back in less than an hour.

zoobab3y ago

Github devs cannot point to their git commit, because Github is not open source.

yakubin3y ago

Keep it simple, just vendor your deps.

reindeerer3y ago

DoctorNick3y ago

With what? The abomination that is `git submodules`?

yakubin3y ago

How it’s done in Chromium: <https://source.chromium.org/chromium/chromium/src/+/main:thi...>.

skobovm3y ago

1 more reply

rabexc233y ago

vendoring even with a tool has always worked poorly with me. Here are a few reasons:

3. For anything but small/tiny projects, the vendoring will take up most of the download / checkout time of your repository.

If you use git for vendoring, the problem is not significantly better: if you care about the integrity of the vendored code, you need to verify the final tree, or the log / hash / set of commits.

Compare to using a simple file with a 1) url, 2) secure hash, 3) list of patches to apply. Reviewing and ensuring correctness is trivial, upgrading is trivial, PRs are trivial.

1 more reply

SuperSandro20003y ago

Thats why nix unpacks the archives first and then hashes them.

gray_-_wolf3y ago

Did people not know this? Honest question. I did run into this few times already before this change, so I assumed this would be wide-spread knowledge and mirrored everything.

skobovm3y ago

anecdotal13y ago

They have not been stable

https://github.com/freebsd/freebsd-ports/commit/a43ec88422ee...

mhitza3y ago

https://xkcd.com/1053/

daniealapt3y ago

Any change breaks a workflow - https://xkcd.com/1172/

capableweb3y ago

j / k navigate · click thread line to collapse