Counting Objects (opens in new tab)

(githubengineering.com)

267 pointsjssjr10y ago37 comments

37 comments

This is technical recruiting done right, a lucid walkthrough of a hard problem complete with links to the implementation. It must have taken at least a few weeks of engineer-time to write, Github is awesome for making this public.

I wish they had talked a little more about the tradeoff they made. They mentioned that splitting packfiles by fork was space-prohibitive, but ended up with a solution which must take more space than originally used. (If the new heuristic refuses to use some objects as delta bases, some options which would have provided the best compression are no longer available to git)

The performance win is incredible, how much space did they give up in the process?

peff10y ago

The fork-aware packfiles are bigger, but not as much as you might think. We give up the "best" delta that git can find, but usually there is another one in the same group that is almost as good.

I didn't have any numbers on hand, so I just repacked our git/git network.git with and without the delta-aware code. A stock repack is about 300MB, and the delta-aware one is 400MB.

That sounds like a lot, but the real alternative is not sharing objects at all, which is more like 350GB.

omouse10y ago

This is an actual interesting problem that takes skill and knowledge to solve and it actually produces immense business value too (at least for Github). This is the kind of stuff more software engineers should be working on. I'm dying over here working on a small React JS component and some CSS layout tinkering, my brain's got graph traversal algorithms and bitmap indexes on its mind and can't do anything about it.

mikekchar10y ago

Every large problem is a series of small problems. I've worked on large complex systems and on small toy problems. It really isn't much different except for one thing. With toy problems you can get away with making a lot of mistakes because the interactions are simple. As the system becomes more and more complex, your execution as a programmer becomes more and more important. You have to think very deeply about the "why" and you have to be able to test your assumptions well.

If you have ever played the game of go, when you first start out the board is empty. You have to place your stone somewhere, but actually it doesn't really matter where. Over time, as you place more and more stones, your choices become more and more important (and ironically, you have more and more constraints). The weird thing is that, even though it doesn't matter where you place those first stones, it becomes very important where the stones were originally placed as the situation unfolds.

Programming is similar. When you first start a project, it really doesn't matter what you do. Almost everything will work to one degree or another. But the original decisions gain more and more weight as the project becomes more and more complex (and you are faced with more and more compromises). Eventually those original decisions can make or break your project, even though it didn't matter at first what they were (of course, this is why refactoring is so important --- but that's a different discussion).

My point (finally) is, that even though you may be making simple changes on simple systems, it doesn't have to stop you from understanding the implications of your work should the project become larger. Take the opportunity to polish your skills and make your "opening game" as perfect as you can make it.

I agree that at some point every programmer must start working on complex systems in order to grow. If you are at that point and your employer does not offer complex problems, then maybe it is time to move. However, don't neglect your "opening game". It is very, very important because, as I said at the beginning, every large problem is a series of small problems.

1 more reply

rakoo10y ago

Not specific to the post, but:

> we're sending few objects, all from the tip of the repository, and these objects will usually be delta'ed against older objects that won't be sent. Therefore, Git tries to find new delta bases for these objects.

Why is this the case ? git can send thin packs if the receiver already has the objects, why does it still need to find a full base to diff against ? (Not counting when initial base objects are from another fork -- I don't know if it's often the case)

On top of that as far as I understood from the discussion about heuristics (https://git.kernel.org/cgit/git/git.git/tree/Documentation/t...) it seems like the latest objects are full and the earlier objects are diffed against them (double benefits: you usually want access to the last object which is already full, and earlier objects tend to be only remove stuff, not add because "stuff grows over time). So if objects are still stored as packs, things should already be in a pretty good shape to be sent as-is... or not ?

Anderkent10y ago

Their initial test case was a full clone, in which case you can't really send a shallow pack.

The problem, as I understand it, was that when you requested a full clone of fork1/repo.git, they'd find all objects reachable from refs in that repo, but git would by default generate deltas for those objects that referred to objects from other forks. When it noticed those objects were not going to be sent to the client, nor did client know about them, git recovered by doing expensive matching across the objects that it was sending, and without having traversed the graph before its heuristics weren't working properly so this took forever.

OJFord10y ago

This is awesome. I absolutely love <company>engineering blogs like this.

There's no real reason for companies to educate external devs/hobbyists/students like this, but some do, and it's really awesome.

artursapek10y ago

The reason is recruiting, and it's effective.

OJFord10y ago

Because people read it and think "hey that sounds cool, I want to work there"?

sophacles10y ago

Pretty much.

There are second order effects as well. Having a good tech blog that talks about this stuff gains the company mindshare amongst developers. If, as in in the case of github, the developers (etc) are the customer, then the translation to marketshare is pretty straight-forward.

When developers (etc) are not the target audience, they still get warm-fuzzies about the company and it gains word of mouth, and non-tech who seek out a recommendation from their friends, will hear about the company with the good blog. Usually in the form of "Oh yeah, there are a couple of alternatives, but the tech at $X is top notch, look hard at them".

1 more reply

delinka10y ago

Perhaps I'm a bit ignorant of git's storage and protocols, but what's the purpose of this initial count? Seems to me it's traversing the tree twice - once to count and once to send the objects across the network. So why not traverse the tree, sending objects that need to be sent and ignoring the ones that don't instead of counting?

Tobu10y ago

You need to know which objects are reachable before you can delta compress them (or reuse a packed delta chain). I guess you could restrict compression antecedents to graph ancestors in your pack, but at the cost of not compressing similar siblings (and blobs are leaves).

codezero10y ago

Hopefully this isn't too off topic, does anyone know what software was used to make the charts in this post?

benihana10y ago

Judging by the water colors I'm guessing Paper by FiftyThree.

politician10y ago

The Azure team responsible for the Local Git implementation needs to read this fantastic article.

I've been putting up with 10-minute deploys due to precisely this issue of counting objects. It's slow because we don't use Local Git as our source-of-record repository (because commits initiate a deployment step), so every deploy involves a clean fetch into a new tmpdir.

At least now I know why our deploys are getting slower and slower.

alagappanr10y ago

I wonder why GitHub has a separate domain githubengineering.com for this blog instead of a subdomain like engineering.github.com.

I notice that there is an inactive user account called engineering. If at all an User Page is created by that account, it would be available at engineering.github.io.

spb10y ago

GitHub Pages used to be served under github.com subdomains (they were switched after GitHub realized this was a same-origin security hole), and said username.github.com subdomains still redirect to github.io for backward compatibility.

It's not unheard of for GitHub to rename inactive accounts, though: most likely, they gave this its own domain for something like SEO purposes (as it's content marketing).

0x40061410y ago

So does GitHub hire C engineers? Git is implemented in C.

tmm110y ago

Yes, much of GitHub's core infrastructure is written in C.

0x40061410y ago

That's awesome. You don't find many people that enjoy writing in C. I assumed it was Ruby.

1 more reply

samlambert10y ago

Yes. We have open positions for C engineers.

Chris91110y ago

Look at the bottom of the blog post for commits involved in merging these changes upstream for Git 2.0. They are all from vmg.

jsprogrammer10y ago

"so we ported his C++ implementation to C to be backwards compatible"

joshmlewis10y ago

The illustrations look like they were done in Paper. Does anyone know if that's right or not?

maxmcd10y ago

Yep, looks like you are correct: http://news.fiftythree.com/post/118777260503/introducing-thi...

u1440888510y ago

Could this technique be used to optimise graph traversal in a postgresql table? Or are recursive queries still the way to go?

base69810y ago

Not only is the tech achievement amazing, and something that will save countless man hours it's incredibly written to boot.

Amazing work.

jmount10y ago

"When you fork a repository on GitHub, we create a shallow copy of it. This copy has no objects of its own, but it has access to all the objects of an alternate ..."

So does this mean one could attach a GitHub repository by having a lot of shill accounts cline it and add random objects (possibly having a performance impact on the original)? I understand the engineering need for the use of alternates, but wonder about the lowered degree of isolation.

j / k navigate · click thread line to collapse

37 comments

brian_cloutier10y ago

The performance win is incredible, how much space did they give up in the process?

peff10y ago

The fork-aware packfiles are bigger, but not as much as you might think. We give up the "best" delta that git can find, but usually there is another one in the same group that is almost as good.

I didn't have any numbers on hand, so I just repacked our git/git network.git with and without the delta-aware code. A stock repack is about 300MB, and the delta-aware one is 400MB.

That sounds like a lot, but the real alternative is not sharing objects at all, which is more like 350GB.

omouse10y ago

mikekchar10y ago

1 more reply

rakoo10y ago

Not specific to the post, but:

Anderkent10y ago

Their initial test case was a full clone, in which case you can't really send a shallow pack.

OJFord10y ago

This is awesome. I absolutely love <company>engineering blogs like this.

There's no real reason for companies to educate external devs/hobbyists/students like this, but some do, and it's really awesome.

artursapek10y ago

The reason is recruiting, and it's effective.

OJFord10y ago

Because people read it and think "hey that sounds cool, I want to work there"?

sophacles10y ago

Pretty much.

1 more reply

delinka10y ago

Tobu10y ago

codezero10y ago

Hopefully this isn't too off topic, does anyone know what software was used to make the charts in this post?

benihana10y ago

Judging by the water colors I'm guessing Paper by FiftyThree.

politician10y ago

The Azure team responsible for the Local Git implementation needs to read this fantastic article.

At least now I know why our deploys are getting slower and slower.

alagappanr10y ago

I wonder why GitHub has a separate domain githubengineering.com for this blog instead of a subdomain like engineering.github.com.

I notice that there is an inactive user account called engineering. If at all an User Page is created by that account, it would be available at engineering.github.io.

spb10y ago

It's not unheard of for GitHub to rename inactive accounts, though: most likely, they gave this its own domain for something like SEO purposes (as it's content marketing).

0x40061410y ago

So does GitHub hire C engineers? Git is implemented in C.

tmm110y ago

Yes, much of GitHub's core infrastructure is written in C.

0x40061410y ago

That's awesome. You don't find many people that enjoy writing in C. I assumed it was Ruby.

1 more reply

samlambert10y ago

Yes. We have open positions for C engineers.

Chris91110y ago

Look at the bottom of the blog post for commits involved in merging these changes upstream for Git 2.0. They are all from vmg.

jsprogrammer10y ago

"so we ported his C++ implementation to C to be backwards compatible"

joshmlewis10y ago

The illustrations look like they were done in Paper. Does anyone know if that's right or not?

maxmcd10y ago

Yep, looks like you are correct: http://news.fiftythree.com/post/118777260503/introducing-thi...

u1440888510y ago

Could this technique be used to optimise graph traversal in a postgresql table? Or are recursive queries still the way to go?

base69810y ago

Not only is the tech achievement amazing, and something that will save countless man hours it's incredibly written to boot.

Amazing work.

jmount10y ago

"When you fork a repository on GitHub, we create a shallow copy of it. This copy has no objects of its own, but it has access to all the objects of an alternate ..."

j / k navigate · click thread line to collapse