undefined | Better HN

0 pointsCthulhu_5y ago0 comments

SHA1 is close to being broken, but it's not there yet, and Git will be migrating to a better algorithm.

That said, if you could rewrite an older commit, the change would only be applied in a fresh clone, right?

0 comments

Even if you could break SHA1, it's unlikely that your replacement source code would look like it was human-written. Instead, it's going to look like human-written source code containing kilobytes or megabytes of random-looking comments. The comments will only be there to change the hash of the new content back to the hash of the original content. It's not going to be subtle at all.

flingo5y ago

Why would it require that much data? I always thought you wouldn't need to add or change more bytes than are in the output.

Also, git hashes aren't just based on source code. You can add that data anywhere that git uses to generate the hash.

db48x5y ago

That's true of a CRC code, but hashes are a lot harder to break.

Git hashes each file, and puts those hashes into a tree object, like a directory listing. Then it hashes the trees, recursively back up to the root of the repository. Finally the hash of the root tree is put in the commit object, and the commit object is hashed. Thus the two places you can put additional data to be hashed are the file contents (either in existing files or new files), or in the commit message. You can get a few free bits by adjusting less obvious things like the commit timestamp or the author's email address, but not nearly enough to make your forged commit have the same hash as an existing commit.

1 more reply

kibwen5y ago

The git hash surely also takes the contents of binary files into account, so I imagine that in any repo that contains non-text files, an attacker would try to hide the garbage inside e.g. some metadata field of an image file.

db48x5y ago

That's true. PDFs and other document formats are also great because you can include large volumes of data that is never used in the final output.

tomxor5y ago

> That said, if you could rewrite an older commit, the change would only be applied in a fresh clone, right?

I think so, assuming the fetch algorithm is using the hashes to get the deltas which I think it does.

I'm not sure about CVS but with GIT rewriting a _previous_ commit _object_ itself with different blobs but making the commit object itself have the _same_ hash by messing with it's comment wouldn't cause any difference in child commits since commits are pretty much independent other than the pointers to parent/child and incorporating that into it's hash (i.e they would have different trees so the changes would not propagate to the HEAD of the branch).

I think the only way have something end up in the HEAD of a branch AND persist is to break the SHA1 of a blob (i.e a file) by inserting the extra SHA1 breaking content into the blob itself rather than a commit tree (provided that exact blob hash is part of the tree in the HEAD of a branch). Then you would also need to hope that the malicious blob is fetched by the person who writes the next commit to be based upon the HEAD of that branch AND modifies the same file blob so that it persists into the next revision of the blob... seems pretty hard to pull off - pun intended

There is also the issue of pushing a blob that already exists on the remote according to the hash. Even with re-write permission GC might make that hard to do quickly.... I wonder if you would need direct access to the git server to do this.

[EDIT]

Thinking about swapping out SHA1 in the future, you would still want to rehash all of the blobs and trees to prevent SHA1 attacks on old blobs that are unchanged going forward to essentially prevent what I described above.

If you only hashed new blobs with the new algorithm you would need to wait until every file had been touched to be safe.

eru5y ago

Yes, I would assume that most git repositories would want to re-hash all old commits when SHA1 gets replaced.

For backwards compatibility, I suspect we'll add the new hash and keep SHA1 around, unless you specifically disable SHA1.

j / k navigate · click thread line to collapse

0 comments

db48x5y ago

flingo5y ago

Why would it require that much data? I always thought you wouldn't need to add or change more bytes than are in the output.

Also, git hashes aren't just based on source code. You can add that data anywhere that git uses to generate the hash.

db48x5y ago

That's true of a CRC code, but hashes are a lot harder to break.

1 more reply

kibwen5y ago

db48x5y ago

That's true. PDFs and other document formats are also great because you can include large volumes of data that is never used in the final output.

tomxor5y ago

> That said, if you could rewrite an older commit, the change would only be applied in a fresh clone, right?

I think so, assuming the fetch algorithm is using the hashes to get the deltas which I think it does.

[EDIT]

If you only hashed new blobs with the new algorithm you would need to wait until every file had been touched to be safe.

eru5y ago

Yes, I would assume that most git repositories would want to re-hash all old commits when SHA1 gets replaced.

For backwards compatibility, I suspect we'll add the new hash and keep SHA1 around, unless you specifically disable SHA1.

j / k navigate · click thread line to collapse