For the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.
The solution here isn’t to change the entire open source ecosystem.
Well, the norm has been that maintainers generated and distributed a source archive, and that archive being immutable. That workflow is still perfectly fine with GitHub and not impacted by this change.
The problem is that a bunch of maintainers stopped generating and distributing archives, and instead started relying on GitHub to automatically do that for them.
I always thought zip archives from this feature was generated on the fly, maybe cached, because I don't expect GitHub to store zip archive for every commit of every repository.
I'm actually surprised many important projects are relying on a stable output from this feature, and that this output was actually stable.
That sounds like prejudice. Just as a test, I cloned the git repo, which took 29 seconds, then took its hash with `guix hash`, which took 0.387ms.
I think that if you can't handle a 0.4s delay in a build, you have problem problems.
A) How do you catch tarballs that have extra files injected that aren't part of your manifest
B) What does the performance of this look like? Certainly for traditional HDDs this is going to kill performance, but even for SSDs I think verifying a bunch of small files is going to be less efficient than verifying the tarball.
B would just be a normal git checkout, which already validates that all the objects are reachable and git tags (and commits for that matter) can be signed, and since the sha1 hash is signed as well it validates that the entire tree of commits has not been tampered with. So as long you trust git to not lie about what it is writing to disk, you have a valid checkout of that tag.
And if you do expect it to lie, why do you expect tar to not lie about what it is unpacking?
The other method would be having Manifest file with checksum of every file inside the tar and compare that in-flight, could be simple "read from tar, compare to hash, write to disk" (with maybe some tmpfiles for the bigger ones)
Firstly SHA is not a secure hash.
Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.
What am I missing?
It's... literally the Secure Hash Algorithm. (Yes, yes, SHA-1 was broken a while back, but SHA and derivatives were absolutely intended to provide secure collision resistance).
I think you're mixing things up here. Github didn't change the SHA-1 commit IDs in the repositories[1]. They changed the compression algorithm used for (and thus the file contents of) "git archive" output. So your tarballs have the same unpacked data but different hashes under all algorithms, secure or not.
> Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.
Indeed. So you take and record a SHA-256 of the archive file you are tagging such that no one can feasibly do that!
Again, what's happened here is that the links pointing to generated archive files that projects assumed were immutable turned out not to be. It's got nothing to do with security or cryptography.
[1] Which would be a whole-internet-breaking catastrophe, of course. They didn't do that and never will.
This is incorrect, but even if it were true, you could use whatever your hash of choice is instead. Gentoo for example can use whatever hash you like, such as blake2, and the default Gentoo repo captures both the sha512 and blake2 digests in the manifest.
Sha1 is still used for security purposes anyways, even though it really shouldn't be!
Signing git commits still relies on sha1 for security purposes, which I think many people don't realize.
Commit signing only signs the commit object itself, other objects such as the trees, blobs and tags are not involved directly in the signature. The commit object contains sha1 hashes to it's parents, and to a root tree. Since trees contain hashes of all of their items, it creates a recursive chain of hashes of the entire contents of the repo during that point in time!
So signed commits rely entirely on the security of sha1 for now!
You may have already knew all of this about git signing but I thought it might be interesting to mention.
2) The checksum assures you that the file you have is the same your upstream looked at
2) If I and the upstream are both looking at a file that was generated by Github then the Sha may match, but that doesn't prove we weren't both owned by Github.
Perhaps what I am missing is that this isn't part of a reproducible build scenario. There's no attempt to ensure that the file Github had built is the one I would build with the same starting point.