The beginning of Git supporting other hash algorithms (opens in new tab)

(github.com)

427 points_qxtl9y ago125 comments

125 comments

bk22049y ago

I'm the person who's been working on this conversion for some time. This series of commits is actually the sixth, and there will be several more coming. (I just posted the seventh to the list, and I have two more mostly complete.)

The current transition plan is being discussed here: https://public-inbox.org/git/CA+dhYEViN4-boZLN+5QJyE7RtX+q6a...

rurban9y ago

I do like your hashname/nohash idea. If we could come up with a simple compression negotiation protocol also: zlib -> zstd. But this will be much harder, as hashes are internal only, and compression is in the protocol.

kudos to brian m carlson to convince linus to use sha3-256 over sha256. this is really the only sane option we have.

weinzierl9y ago

I don't understand what you mean by "hashes are internal only"? Aren't the sha1's everywhere right now. I mean not only in the protocol but also part of the UI and from there they even spread into bug trackers, documentation and so forth.

lisper9y ago

> this is really the only sane option we have

Why?

1 more reply

weinzierl9y ago

Did you make any measurements or back of the envelope calculations what the real world performance impact of this change is.

I don't expect anything horrible, but still curious.

EDIT: After skimming OP I found a few answers.

The message from the The Keccak Team [1] is especially interesting. Summary is that we don't have to worry about performance degradation because of the hash calculation itself. There is a palette of functions which are considered to have a "security level [...] appropriate for your application" and are considerably faster than SHA1.

[1] https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5b...

hsivonen9y ago

If git changed to BLAKE2b, I'd expect a perf improvement over SHA-1.

sorenbs9y ago

Out of curiosity: when did you start to take the first serious steps in this direction?

bk22049y ago

From the commit history, 2015 (commit 5f7817c85d4b5f65626c8f49249a6c91292b8513).

I proposed the idea of improved compile-time checking and maintainability, as there wasn't originally much interest in a new hash function, but the maintainability improvements were something people could go for.

I hadn't spent as much time working on it as I am now, so it moved slowly. Other people also helped by converting parts of the code that they were working on (like parts of the refs subsystem).

1 more reply

drostie9y ago

I'm not quite so familiar with the Git internals, how do you deal with the problem of having different non-leaf nodes scattered through the directory tree?

This might be a non-issue based on how Git stores the tree, but I can imagine one simple model where each directory would be a sort of "collection object", a binary encoding of a list of (filename, hash) pairs in filename order, and therefore the directory gets a hash of its own. But that means that when you're communicating with a SHA-1 repository you don't just need to rename this object; its contents also need to be changed pre-rename, and then you need to store every internal node twice. I'm not seeing that in your summary.

Is it just that Git doesn't have any internal nodes in the directory tree per se because the "filename" is a full POSIX path with subdirs? Or what?

evmar9y ago

https://git-scm.com/book/en/v2/Git-Internals-Git-Objects has descriptions of the objects. Both trees and commits are hashes over data that includes hashes of other objects so they must be different. The doc discusses converting them at transmission time, search for [convert to sha256] in it.

snakeanus9y ago

>b. A SHA256 repository can communicate with SHA-1 Git servers and clients (push/fetch).

Wouldn't fetching from a sha-1 repository degrade security? I think it would be better to show a warning (similar to how openssh does with 1024 bit dsa keys) every time you try to fetch from a SHA-1 git repo. Same for pushing a signed commit to a sha-1 repository.

bhhaskin9y ago

The sha1 hash isn't used for security. You should be signing your commits if security is a concern.

1 more reply

lvh9y ago

From a cryptographer's perspective, everything around SHA-3 is a little weird. We ended up with something that's pretty slow even though we had faster things, for which general consensus was that they were just as strong. Similarly, consensus was that some SHA-3 candidates made it as far as they did because they are drastically different from previous designs. Picking a major standard takes a while, and immediately preceding it we saw scary advances in attacks on traditional Merkle-Damgard hashes like SHA-0, SHA-1. Not SHA-2, but it's pretty similar, so the parallels are obvious.

Bow that we have SHA-3, we ended up with a gazillion Keccak variants and Keccak-likes. The authors of Keccak have suggested that Git may instead want to consider e.g. SHAKE128. [0]

[0]: https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5b...

It's a bit unfortunate that this is really a cryptographic choice, and it seems to mostly be made by non-cryptographers. Furthermore, the people making that choice seem to be deeply unhappy about having to make it.

This makes me unhappy, because I wish making cryptographic choices got much easier over time, not harder. While SHA-2 was the most recent SHA, picking the correct hash function was easy: SHA-2. Sure, people built broken constructions (like prefix-MAC or whatever) with SHA-2, but that was just SHA-2 being abused, not SHA-2 being weak.

A lot of those footguns are removed with SHA-3, so I guess safe crypto choices are getting easier to make. On the other hand, the "obvious" choice, being made by aforementioned unhappy maintainers, is slow in a way that probably matters for some use cases. On the other hand, not even the designers think it's an obvious choice, I think most cryptographers don't think it's the best tool we have, and we have a design that we're less sure how to parametrize. There are easy and safe ways to parametrize SHA-3 to e.g. fix flaws like Fossil's artifact confusion -- but BLAKE2b's are faster and more obvious. And it's slow. Somehow, I can't be terribly pleased with that.

lvh9y ago

FWIW, Fossil released a version with backwards compatibility, configurable graceful upgrades a week ago: https://www.fossil-scm.org/index.html/doc/trunk/www/changes....

wolf550e9y ago

Dmitry Chestnykh wrote a little about problems with the documented security claims of Fossil SCM 3 days ago:

https://twitter.com/dchest/status/842489752892968960

https://twitter.com/dchest/status/842498609652383744

david-given9y ago

Given that both claims are unreferenced and using deliberately provocative language, I'd say he wrote very little...

1 more reply

corbet9y ago

This work actually began in 2014... https://lwn.net/Articles/715716/

VMG9y ago

Is there some explainer on how the support will look like in the end? I'm curious to know how multiple hash algorithms will be supported in parallel.

pyed9y ago

Probably newer versions will commit only using a new hash algorithm, while completely able to deal with the old one

ebbv9y ago

Can it really be that simple though? If you are using a newer version of Git on your repo which is committing only with the newer hash and I try to clone your repo with an older version I will be unable to do so. I guess maybe that's acceptable though?

4 more replies

AlexCoventry9y ago

I don't have a good solution to this, but that sounds like it risks the same sort of crypto downgrade vulnerabilities which TLS cipher negotiation enabled.

1 more reply

benhoyt9y ago

I immediately looked at the length of this commit's hash to see if it was longer than 40 hex chars -- but no, it's just an SHA-1. It would have been cool if somehow the hash of this commit that added new hashes was a new hash.

Slightly similar: for a while I've wanted to recreate just enough of git's functionality to commit and push to GitHub. My guess is the commit part would be pretty trivial (as git's object and tree model is so simple) but the push/network/remote part a bunch harder.

gkya9y ago

The commit on git.kernel.org: https://git.kernel.org/pub/scm/git/git.git/commit/?id=e1fae9...

zoren9y ago

Someone please remind me why the hash is not a type definition so the representation would only have to be changed in one place.

jffry9y ago

If you have a repo with a lot of GPG signed commits, or you just don't want to change all your commit IDs (because you reference them in other places), then it'd be very valuable to be able to have a repo that's mixed old and new hashes.

Also your Git binary, if compiled with only the One True Hash™, wouldn't be able to work with older repos at all because the hashes it's calculating are now different.

(Edit: Another benefit of generalizing this is so that if/when, in the future, the new hash algorithm must be abandoned due to weaknesses, Git tooling will have been already introduced to the notion that hashes can be different and should hopefully be a less involved migration the next time around)

loeg9y ago

The one typedef could have just been changed from char[20] to 'struct objectid' to support multiple hash types.

1 more reply

dahart9y ago

That's exactly what this change is. You mean why wasn't it that way before the change? Maybe because it wasn't ever needed before? Git's been good with only sha-1 for 12 years. Think about the flip side of your question... what were the alternatives 12 years ago, or 5 years ago? And why would someone write code for alternatives that aren't expected to be used and maybe don't exist?

In my experience, generalizing ahead of need more often than not causes problems, and I've watched over-engineering result in far more effort to fix when the need it was anticipating does arrive than just waiting until the need is there.

indolering9y ago

> Think about the flip side of your question... what were the alternatives 12 years ago, or 5 years ago?

SHA-2 and RIPEMD.

> And why would someone write code for alternatives that aren't expected to be used and maybe don't exist?

That's the problem: the software industry is still suffering from MD5 getting cracked [0]! Cryptographic agility is a baseline requirement for security primitives.

> In my experience, generalizing ahead of need more often than not causes problems

I agree and Linus has valid complaints about security recommendations during the 25-year history of Linux: most of the security recommendations kill performance and are only partial fixes, so why bother?

But Linus is also engaging in premature optimization: computers are ~30 billion times faster than when he first starting programming Linux. Yes, SHA-2 is relatively slow, they could have at least not hardcoded SHA-1 into the codebase and protocol.

> I've watched over-engineering result in far more effort to fix when the need it was anticipating does arrive than just waiting until the need is there.

You clearly haven't done any safety related engineering. That's the thing about cryptography: millions of dollars and human lives are at stake. Despite the smartest people in the world working on these problems, cryptographic primitives/protocols are regularly broken. Due to Quantum computing, every common cryptographic primitive we use today will need to be replaced or upgraded at some point.

Thankfully, you don't need to worry about the engineering of a given cryptographic primitive as long as you can swap it out with a new one. But when you hardcode a specific hash function and length into your protocol/codebase you are now assuming the role of a cryptographer.

[0]: https://en.wikipedia.org/wiki/Flame_%28malware%29

tedunangst9y ago

Even totally ignoring that SHA2 was a thing, anybody looking around would have noticed that MD4 was broken, MD5 was broken, and it would be unlikely that the hash of today would stand forever.

2 more replies

loeg9y ago

> what were the alternatives 12 years ago, or 5 years ago?

SHA-2.

> And why would someone write code for alternatives that aren't expected to be used and maybe don't exist?

Well, the real question is why someone picked SHA-1 over SHA-2 in 2005 when attacks that reduced its strength were already being demonstrated.

1 more reply

mikeash9y ago

The first attack on SHA-1 was published in 2003. Git showed up in 2005. Not only should git have allowed for something else, it never should have used SHA-1 in the first place.

digi_owl9y ago

Not really adding much, but damn it i feel old reading that.

I still recall freshly the hoopla over Bitkeeper licensing that lead to Torvalds creating Git.

lvh9y ago

SHA2 predates git by about four years.

asveikau9y ago

How to say this without being rude.. You didn't read the diff.

To derisively say "remind me why not X" at a diff that does X ... I am amused.

nuntius9y ago

Backwards compat requires that both old and new hashes work at the same time. A simple typedef is unlikely to handle all the semantics and space needed for such a change...

It is often hard to generalize when N=1. Now that the N=1 use case is established and we are moving towards N=2, it is painfully obvious to all that a better abstraction is needed.

Typedef or no, we would still need a full audit of the code to find spots where people "inlined" the expansion.

IMO, Linus should have done better here -- no crypto hash lasts forever, but this code is far cleaner than useless layers of abstraction.

jlgaddis9y ago

Perhaps you haven't read Linus' comments where he stated (more than a decade ago) that the usage of SHA1 here isn't for "security"?

(Hint: that's why GPG signing commits is an option.)

4 more replies

csense9y ago

They want the new version to be backward compatible with existing sha1 repos and remotes. Also, sha256 hashes are longer.

ossmaster9y ago

So could be my ignorance of this project in detail, but where are the tests for this?

smileysteve9y ago

The t/ directory.

https://github.com/git/git/blob/master/t/README

kozak9y ago

Do they anticipate that one day we'll have to move from SHA256 to something else again? It's only matter of time. Hash function have lifecycle. Tre transition has to be done in a way that will also make the next transition more straightforward.

chmod7759y ago

Reading even one changed line tells us that they replaced hardcoded char arrays for SHA1 with a generic struct that could be used as a container for any hash.

Some functions that previously operated on those char arrays have been changed to deal with the more generic struct instead.

angry_octet9y ago

I consider it unlikely that it will change again, but somehow it is unsatisfying that it doesn't have a hash version, e.g. in the first nibble of the hash. If they had done that we could have avoided the unpleasantness long ago.

anilgulecha9y ago

A note on 'lifecycle': that's not how it works -- the age of use/lifecycle is not a function of the bit-length in hash, or inevitability of the current standard being broken.

Technically MD5(128bits) and SHA1(160bits) lengths are sufficient for hashes, but they had cryptographic weaknesses -- the functions had cryptanalytic attacks, which reduced bruteforce from the complete keyspace to something of a much smaller magnitude. These weaknesses are what has lead to the deprecation of MD5 and SHA1.

It is definitely possible that new crypt-analytic attacks could be shown on SHA256/512, but none have so far been publicly provided. Hence the confidence in them.

amluto9y ago

> Technically MD5(128bits) and SHA1(160bits) lengths are sufficient for hashes, but they had cryptographic weaknesses

Not true. A 128-bit hash gets collisions after ~2^64 tries. A big cluster can find targeted 128-bit collisions. To attack something like git, the entire attack can be done offline.

The big MD5 X.509 break needed cryptanalysis to make it day I because the attack needed to happen in real time.

rocqua9y ago

The threat models in which 64 bits of security (by birthday attack on 128bit hashes) is insufficient are really limited.

3 more replies

kozak9y ago

Yep, I'm not about the bit length: 256 bit "should be enough for everybody". But algorithms to generate those 256 bits will change.

btrask9y ago

This is the chance to get rid of the object prefixes (i.e. "blob" plus file length) that prevent the generated hashes from being compatible with hashes generated by other software.

koolba9y ago

Since the majority of us are running x64 machines, will the hash be a truncated SHA-512/256 or will it be SHA-256? The former is significantly faster on x64 machines.

joatmon-snoo9y ago

The RFC is still under discussion (there are a few plans going around) but the strong contender right now is SHA3-256, no truncation.

snakeanus9y ago

>Since the majority of us are running x64 machines

We don't.

koolba9y ago

I didn't say all. I said the majority. If you think I'm wrong, show me a statistic that shows the most common platform for developers using git isn't x86-x64.

kazinator9y ago

What problem does this solve? Are collisions common?

krallja9y ago

Until a few weeks ago, SHA-1 collisions had never been demonstrated.

kazinator9y ago

But, in any case, that's in the cryptographic realm.

Git hashes aren't digital signatures for cryptographic authenticity.

1 more reply

pwdisswordfish9y ago

struct object_id was introduced in this commit, in 2015:

https://git.kernel.org/pub/scm/git/git.git/commit/?id=5f7817...

So this change doesn't do much for now. Good to see, though.

bk22049y ago

Yes, this is correct. The struct object_id changes don't actually change the hash. What they do, however, is allow us to remove a lot of the hard-coded instances of 20 and 40 (SHA-1 length in bytes and hex, respectively) in the codebase.

The remaining instances of those values become constants or variables (which I'm also doing as part of the series), and it then becomes much easier to add a new hash function, since we've enumerated all the places we need to update (and can do so with a simple sed one-liner).

The biggest impediment to adding a new hash function has been dealing with the hard-coded constants everywhere.

myst9y ago

Says something about quality of the codebase.

j / k navigate · click thread line to collapse

125 comments

bk22049y ago

The current transition plan is being discussed here: https://public-inbox.org/git/CA+dhYEViN4-boZLN+5QJyE7RtX+q6a...

rurban9y ago

kudos to brian m carlson to convince linus to use sha3-256 over sha256. this is really the only sane option we have.

weinzierl9y ago

lisper9y ago

> this is really the only sane option we have

Why?

1 more reply

weinzierl9y ago

Did you make any measurements or back of the envelope calculations what the real world performance impact of this change is.

I don't expect anything horrible, but still curious.

EDIT: After skimming OP I found a few answers.

[1] https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5b...

hsivonen9y ago

If git changed to BLAKE2b, I'd expect a perf improvement over SHA-1.

sorenbs9y ago

Out of curiosity: when did you start to take the first serious steps in this direction?

bk22049y ago

From the commit history, 2015 (commit 5f7817c85d4b5f65626c8f49249a6c91292b8513).

I hadn't spent as much time working on it as I am now, so it moved slowly. Other people also helped by converting parts of the code that they were working on (like parts of the refs subsystem).

1 more reply

drostie9y ago

I'm not quite so familiar with the Git internals, how do you deal with the problem of having different non-leaf nodes scattered through the directory tree?

Is it just that Git doesn't have any internal nodes in the directory tree per se because the "filename" is a full POSIX path with subdirs? Or what?

evmar9y ago

snakeanus9y ago

>b. A SHA256 repository can communicate with SHA-1 Git servers and clients (push/fetch).

bhhaskin9y ago

The sha1 hash isn't used for security. You should be signing your commits if security is a concern.

1 more reply

lvh9y ago

Bow that we have SHA-3, we ended up with a gazillion Keccak variants and Keccak-likes. The authors of Keccak have suggested that Git may instead want to consider e.g. SHAKE128. [0]

[0]: https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5b...

lvh9y ago

FWIW, Fossil released a version with backwards compatibility, configurable graceful upgrades a week ago: https://www.fossil-scm.org/index.html/doc/trunk/www/changes....

wolf550e9y ago

Dmitry Chestnykh wrote a little about problems with the documented security claims of Fossil SCM 3 days ago:

https://twitter.com/dchest/status/842489752892968960

https://twitter.com/dchest/status/842498609652383744

david-given9y ago

Given that both claims are unreferenced and using deliberately provocative language, I'd say he wrote very little...

1 more reply

corbet9y ago

This work actually began in 2014... https://lwn.net/Articles/715716/

VMG9y ago

Is there some explainer on how the support will look like in the end? I'm curious to know how multiple hash algorithms will be supported in parallel.

pyed9y ago

Probably newer versions will commit only using a new hash algorithm, while completely able to deal with the old one

ebbv9y ago

4 more replies

AlexCoventry9y ago

I don't have a good solution to this, but that sounds like it risks the same sort of crypto downgrade vulnerabilities which TLS cipher negotiation enabled.

1 more reply

benhoyt9y ago

gkya9y ago

The commit on git.kernel.org: https://git.kernel.org/pub/scm/git/git.git/commit/?id=e1fae9...

zoren9y ago

Someone please remind me why the hash is not a type definition so the representation would only have to be changed in one place.

jffry9y ago

Also your Git binary, if compiled with only the One True Hash™, wouldn't be able to work with older repos at all because the hashes it's calculating are now different.

loeg9y ago

The one typedef could have just been changed from char[20] to 'struct objectid' to support multiple hash types.

1 more reply

dahart9y ago

indolering9y ago

> Think about the flip side of your question... what were the alternatives 12 years ago, or 5 years ago?

SHA-2 and RIPEMD.

> And why would someone write code for alternatives that aren't expected to be used and maybe don't exist?

That's the problem: the software industry is still suffering from MD5 getting cracked [0]! Cryptographic agility is a baseline requirement for security primitives.

> In my experience, generalizing ahead of need more often than not causes problems

> I've watched over-engineering result in far more effort to fix when the need it was anticipating does arrive than just waiting until the need is there.

[0]: https://en.wikipedia.org/wiki/Flame_%28malware%29

tedunangst9y ago

Even totally ignoring that SHA2 was a thing, anybody looking around would have noticed that MD4 was broken, MD5 was broken, and it would be unlikely that the hash of today would stand forever.

2 more replies

loeg9y ago

> what were the alternatives 12 years ago, or 5 years ago?

SHA-2.

> And why would someone write code for alternatives that aren't expected to be used and maybe don't exist?

Well, the real question is why someone picked SHA-1 over SHA-2 in 2005 when attacks that reduced its strength were already being demonstrated.

1 more reply

mikeash9y ago

The first attack on SHA-1 was published in 2003. Git showed up in 2005. Not only should git have allowed for something else, it never should have used SHA-1 in the first place.

digi_owl9y ago

Not really adding much, but damn it i feel old reading that.

I still recall freshly the hoopla over Bitkeeper licensing that lead to Torvalds creating Git.

lvh9y ago

SHA2 predates git by about four years.

asveikau9y ago

How to say this without being rude.. You didn't read the diff.

To derisively say "remind me why not X" at a diff that does X ... I am amused.

nuntius9y ago

Backwards compat requires that both old and new hashes work at the same time. A simple typedef is unlikely to handle all the semantics and space needed for such a change...

It is often hard to generalize when N=1. Now that the N=1 use case is established and we are moving towards N=2, it is painfully obvious to all that a better abstraction is needed.

Typedef or no, we would still need a full audit of the code to find spots where people "inlined" the expansion.

IMO, Linus should have done better here -- no crypto hash lasts forever, but this code is far cleaner than useless layers of abstraction.

jlgaddis9y ago

Perhaps you haven't read Linus' comments where he stated (more than a decade ago) that the usage of SHA1 here isn't for "security"?

(Hint: that's why GPG signing commits is an option.)

4 more replies

csense9y ago

They want the new version to be backward compatible with existing sha1 repos and remotes. Also, sha256 hashes are longer.

ossmaster9y ago

So could be my ignorance of this project in detail, but where are the tests for this?

smileysteve9y ago

The t/ directory.

https://github.com/git/git/blob/master/t/README

kozak9y ago

chmod7759y ago

Reading even one changed line tells us that they replaced hardcoded char arrays for SHA1 with a generic struct that could be used as a container for any hash.

Some functions that previously operated on those char arrays have been changed to deal with the more generic struct instead.

angry_octet9y ago

anilgulecha9y ago

A note on 'lifecycle': that's not how it works -- the age of use/lifecycle is not a function of the bit-length in hash, or inevitability of the current standard being broken.

It is definitely possible that new crypt-analytic attacks could be shown on SHA256/512, but none have so far been publicly provided. Hence the confidence in them.

amluto9y ago

> Technically MD5(128bits) and SHA1(160bits) lengths are sufficient for hashes, but they had cryptographic weaknesses

Not true. A 128-bit hash gets collisions after ~2^64 tries. A big cluster can find targeted 128-bit collisions. To attack something like git, the entire attack can be done offline.

The big MD5 X.509 break needed cryptanalysis to make it day I because the attack needed to happen in real time.

rocqua9y ago

The threat models in which 64 bits of security (by birthday attack on 128bit hashes) is insufficient are really limited.

3 more replies

kozak9y ago

Yep, I'm not about the bit length: 256 bit "should be enough for everybody". But algorithms to generate those 256 bits will change.

btrask9y ago

This is the chance to get rid of the object prefixes (i.e. "blob" plus file length) that prevent the generated hashes from being compatible with hashes generated by other software.

koolba9y ago

Since the majority of us are running x64 machines, will the hash be a truncated SHA-512/256 or will it be SHA-256? The former is significantly faster on x64 machines.

joatmon-snoo9y ago

The RFC is still under discussion (there are a few plans going around) but the strong contender right now is SHA3-256, no truncation.

snakeanus9y ago

>Since the majority of us are running x64 machines

We don't.

koolba9y ago

I didn't say all. I said the majority. If you think I'm wrong, show me a statistic that shows the most common platform for developers using git isn't x86-x64.

kazinator9y ago

What problem does this solve? Are collisions common?

krallja9y ago

Until a few weeks ago, SHA-1 collisions had never been demonstrated.

kazinator9y ago

But, in any case, that's in the cryptographic realm.

Git hashes aren't digital signatures for cryptographic authenticity.

1 more reply

pwdisswordfish9y ago

struct object_id was introduced in this commit, in 2015:

https://git.kernel.org/pub/scm/git/git.git/commit/?id=5f7817...

So this change doesn't do much for now. Good to see, though.

bk22049y ago

The biggest impediment to adding a new hash function has been dealing with the hard-coded constants everywhere.

myst9y ago

Says something about quality of the codebase.

j / k navigate · click thread line to collapse