undefined | Better HN

0 pointsswarfield3y ago0 comments

Using SHA hashes when building guarantees that the code that you are building is what you think it is. How else would you verify dependencies like this, GPG signatures would have the same issue if you change the underlying bits.

0 comments

Denvercoder93y ago

I wouldn't check the hash of the compressed archive, but of the actual files themselves. It's a bit more metadata, but it's also a lot more robust, and allows you to detect changes after unpacking as well.

bentley3y ago

It’s generally a bad idea to process (extract) a tarball of unknown provenance. Verifying the tarball is from a known source beforehand mitigates the risk of, say, a malicious tarball that exploits a tar or gzip 0‐day.

shakow3y ago

But then that's the role of the httpS query with which you will fetch your data.

And if you don't trust your http layer and/or Github's certificate, then you should not trust their archive anyway.

2 more replies

ilyt3y ago

Or just contains 100TB of zeroes

shakow3y ago

By checking the hash of the extracted files. The hash of the archive is dependent on the order in which the file were compressed, the compression, some metadata, etc.

catiopatio3y ago

That’s expensive, complicated, exposes a greater attack surface, and requires new tooling to maintain considerably more complex metadata covering the full contents of source archives.

For the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.

The solution here isn’t to change the entire open source ecosystem.

Denvercoder93y ago

> For literally the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.

Well, the norm has been that maintainers generated and distributed a source archive, and that archive being immutable. That workflow is still perfectly fine with GitHub and not impacted by this change.

The problem is that a bunch of maintainers stopped generating and distributing archives, and instead started relying on GitHub to automatically do that for them.

2 more replies

jraph3y ago

You are speaking about release archives. GitHub's "Download as zip" feature is not the same thing as this multi decade-history of open source thing you are talking about.

I always thought zip archives from this feature was generated on the fly, maybe cached, because I don't expect GitHub to store zip archive for every commit of every repository.

I'm actually surprised many important projects are relying on a stable output from this feature, and that this output was actually stable.

bentley3y ago

Indeed. I remember when Canonical was heavily pushing bzr and others were big fans of Mercurial. Glad my package manager maintainers didn’t waste time writing infrastructure to handle those projects at the repository level. Nobody had to, because providing source tarballs was the norm.

shakow3y ago

> That’s expensive, complicated,

That sounds like prejudice. Just as a test, I cloned the git repo, which took 29 seconds, then took its hash with `guix hash`, which took 0.387ms.

I think that if you can't handle a 0.4s delay in a build, you have problem problems.

1 more reply

ArchOversight3y ago

a git checkout of the code at that particular tag hasn't changed. Just the tarball that git archive generates has.

vlovich1233y ago

The two main problems are:

A) How do you catch tarballs that have extra files injected that aren't part of your manifest

B) What does the performance of this look like? Certainly for traditional HDDs this is going to kill performance, but even for SSDs I think verifying a bunch of small files is going to be less efficient than verifying the tarball.

ArchOversight3y ago

A wouldn't be an issue since you are checking out a git tag.

B would just be a normal git checkout, which already validates that all the objects are reachable and git tags (and commits for that matter) can be signed, and since the sha1 hash is signed as well it validates that the entire tree of commits has not been tampered with. So as long you trust git to not lie about what it is writing to disk, you have a valid checkout of that tag.

And if you do expect it to lie, why do you expect tar to not lie about what it is unpacking?

1 more reply

ilyt3y ago

Well, the simplest way would be to make checksum after decompression, that doesn't need per file verify and relies on files being put in same order into tar file.

The other method would be having Manifest file with checksum of every file inside the tar and compare that in-flight, could be simple "read from tar, compare to hash, write to disk" (with maybe some tmpfiles for the bigger ones)

1 more reply

duped3y ago

Ok, now guarantee that.

ErikCorry3y ago

This seems like a weak argument.

Firstly SHA is not a secure hash.

Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.

What am I missing?

ajross3y ago

> Firstly SHA is not a secure hash.

It's... literally the Secure Hash Algorithm. (Yes, yes, SHA-1 was broken a while back, but SHA and derivatives were absolutely intended to provide secure collision resistance).

I think you're mixing things up here. Github didn't change the SHA-1 commit IDs in the repositories[1]. They changed the compression algorithm used for (and thus the file contents of) "git archive" output. So your tarballs have the same unpacked data but different hashes under all algorithms, secure or not.

> Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.

Indeed. So you take and record a SHA-256 of the archive file you are tagging such that no one can feasibly do that!

Again, what's happened here is that the links pointing to generated archive files that projects assumed were immutable turned out not to be. It's got nothing to do with security or cryptography.

[1] Which would be a whole-internet-breaking catastrophe, of course. They didn't do that and never will.

chlorion3y ago

>Firstly SHA is not a secure hash.

This is incorrect, but even if it were true, you could use whatever your hash of choice is instead. Gentoo for example can use whatever hash you like, such as blake2, and the default Gentoo repo captures both the sha512 and blake2 digests in the manifest.

Sha1 is still used for security purposes anyways, even though it really shouldn't be!

Signing git commits still relies on sha1 for security purposes, which I think many people don't realize.

Commit signing only signs the commit object itself, other objects such as the trees, blobs and tags are not involved directly in the signature. The commit object contains sha1 hashes to it's parents, and to a root tree. Since trees contain hashes of all of their items, it creates a recursive chain of hashes of the entire contents of the repo during that point in time!

So signed commits rely entirely on the security of sha1 for now!

You may have already knew all of this about git signing but I thought it might be interesting to mention.

blueflow3y ago

1) SHA-256 is reasonably secure

2) The checksum assures you that the file you have is the same your upstream looked at

ErikCorry3y ago

1) Ah of course, this is SHA256, my mistake.

2) If I and the upstream are both looking at a file that was generated by Github then the Sha may match, but that doesn't prove we weren't both owned by Github.

Perhaps what I am missing is that this isn't part of a reproducible build scenario. There's no attempt to ensure that the file Github had built is the one I would build with the same starting point.

1 more reply

IanCal3y ago

I think the reproducible build part is about projects that depend on these outputs. The goal is ensuring you and I have both pulled exactly the same dependencies.

j / k navigate · click thread line to collapse

0 comments

Denvercoder93y ago

bentley3y ago

shakow3y ago

But then that's the role of the httpS query with which you will fetch your data.

And if you don't trust your http layer and/or Github's certificate, then you should not trust their archive anyway.

2 more replies

ilyt3y ago

Or just contains 100TB of zeroes

shakow3y ago

By checking the hash of the extracted files. The hash of the archive is dependent on the order in which the file were compressed, the compression, some metadata, etc.

catiopatio3y ago

That’s expensive, complicated, exposes a greater attack surface, and requires new tooling to maintain considerably more complex metadata covering the full contents of source archives.

For the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.

The solution here isn’t to change the entire open source ecosystem.

Denvercoder93y ago

> For literally the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.

The problem is that a bunch of maintainers stopped generating and distributing archives, and instead started relying on GitHub to automatically do that for them.

2 more replies

jraph3y ago

You are speaking about release archives. GitHub's "Download as zip" feature is not the same thing as this multi decade-history of open source thing you are talking about.

I always thought zip archives from this feature was generated on the fly, maybe cached, because I don't expect GitHub to store zip archive for every commit of every repository.

I'm actually surprised many important projects are relying on a stable output from this feature, and that this output was actually stable.

bentley3y ago

shakow3y ago

> That’s expensive, complicated,

That sounds like prejudice. Just as a test, I cloned the git repo, which took 29 seconds, then took its hash with `guix hash`, which took 0.387ms.

I think that if you can't handle a 0.4s delay in a build, you have problem problems.

1 more reply

ArchOversight3y ago

a git checkout of the code at that particular tag hasn't changed. Just the tarball that git archive generates has.

vlovich1233y ago

The two main problems are:

A) How do you catch tarballs that have extra files injected that aren't part of your manifest

ArchOversight3y ago

A wouldn't be an issue since you are checking out a git tag.

And if you do expect it to lie, why do you expect tar to not lie about what it is unpacking?

1 more reply

ilyt3y ago

Well, the simplest way would be to make checksum after decompression, that doesn't need per file verify and relies on files being put in same order into tar file.

1 more reply

duped3y ago

Ok, now guarantee that.

ErikCorry3y ago

This seems like a weak argument.

Firstly SHA is not a secure hash.

What am I missing?

ajross3y ago

> Firstly SHA is not a secure hash.

It's... literally the Secure Hash Algorithm. (Yes, yes, SHA-1 was broken a while back, but SHA and derivatives were absolutely intended to provide secure collision resistance).

Indeed. So you take and record a SHA-256 of the archive file you are tagging such that no one can feasibly do that!

Again, what's happened here is that the links pointing to generated archive files that projects assumed were immutable turned out not to be. It's got nothing to do with security or cryptography.

[1] Which would be a whole-internet-breaking catastrophe, of course. They didn't do that and never will.

chlorion3y ago

>Firstly SHA is not a secure hash.

Sha1 is still used for security purposes anyways, even though it really shouldn't be!

Signing git commits still relies on sha1 for security purposes, which I think many people don't realize.

So signed commits rely entirely on the security of sha1 for now!

You may have already knew all of this about git signing but I thought it might be interesting to mention.

blueflow3y ago

1) SHA-256 is reasonably secure

2) The checksum assures you that the file you have is the same your upstream looked at

ErikCorry3y ago

1) Ah of course, this is SHA256, my mistake.

2) If I and the upstream are both looking at a file that was generated by Github then the Sha may match, but that doesn't prove we weren't both owned by Github.

Perhaps what I am missing is that this isn't part of a reproducible build scenario. There's no attempt to ensure that the file Github had built is the one I would build with the same starting point.

1 more reply

IanCal3y ago

I think the reproducible build part is about projects that depend on these outputs. The goal is ensuring you and I have both pulled exactly the same dependencies.

j / k navigate · click thread line to collapse