I wish they had talked a little more about the tradeoff they made. They mentioned that splitting packfiles by fork was space-prohibitive, but ended up with a solution which must take more space than originally used. (If the new heuristic refuses to use some objects as delta bases, some options which would have provided the best compression are no longer available to git)
The performance win is incredible, how much space did they give up in the process?
I didn't have any numbers on hand, so I just repacked our git/git network.git with and without the delta-aware code. A stock repack is about 300MB, and the delta-aware one is 400MB.
That sounds like a lot, but the real alternative is not sharing objects at all, which is more like 350GB.
If you have ever played the game of go, when you first start out the board is empty. You have to place your stone somewhere, but actually it doesn't really matter where. Over time, as you place more and more stones, your choices become more and more important (and ironically, you have more and more constraints). The weird thing is that, even though it doesn't matter where you place those first stones, it becomes very important where the stones were originally placed as the situation unfolds.
Programming is similar. When you first start a project, it really doesn't matter what you do. Almost everything will work to one degree or another. But the original decisions gain more and more weight as the project becomes more and more complex (and you are faced with more and more compromises). Eventually those original decisions can make or break your project, even though it didn't matter at first what they were (of course, this is why refactoring is so important --- but that's a different discussion).
My point (finally) is, that even though you may be making simple changes on simple systems, it doesn't have to stop you from understanding the implications of your work should the project become larger. Take the opportunity to polish your skills and make your "opening game" as perfect as you can make it.
I agree that at some point every programmer must start working on complex systems in order to grow. If you are at that point and your employer does not offer complex problems, then maybe it is time to move. However, don't neglect your "opening game". It is very, very important because, as I said at the beginning, every large problem is a series of small problems.
> we're sending few objects, all from the tip of the repository, and these objects will usually be delta'ed against older objects that won't be sent. Therefore, Git tries to find new delta bases for these objects.
Why is this the case ? git can send thin packs if the receiver already has the objects, why does it still need to find a full base to diff against ? (Not counting when initial base objects are from another fork -- I don't know if it's often the case)
On top of that as far as I understood from the discussion about heuristics (https://git.kernel.org/cgit/git/git.git/tree/Documentation/t...) it seems like the latest objects are full and the earlier objects are diffed against them (double benefits: you usually want access to the last object which is already full, and earlier objects tend to be only remove stuff, not add because "stuff grows over time). So if objects are still stored as packs, things should already be in a pretty good shape to be sent as-is... or not ?
The problem, as I understand it, was that when you requested a full clone of fork1/repo.git, they'd find all objects reachable from refs in that repo, but git would by default generate deltas for those objects that referred to objects from other forks. When it noticed those objects were not going to be sent to the client, nor did client know about them, git recovered by doing expensive matching across the objects that it was sending, and without having traversed the graph before its heuristics weren't working properly so this took forever.
There's no real reason for companies to educate external devs/hobbyists/students like this, but some do, and it's really awesome.
I've been putting up with 10-minute deploys due to precisely this issue of counting objects. It's slow because we don't use Local Git as our source-of-record repository (because commits initiate a deployment step), so every deploy involves a clean fetch into a new tmpdir.
At least now I know why our deploys are getting slower and slower.
I notice that there is an inactive user account called engineering. If at all an User Page is created by that account, it would be available at engineering.github.io.
It's not unheard of for GitHub to rename inactive accounts, though: most likely, they gave this its own domain for something like SEO purposes (as it's content marketing).
Amazing work.
So does this mean one could attach a GitHub repository by having a lot of shill accounts cline it and add random objects (possibly having a performance impact on the original)? I understand the engineering need for the use of alternates, but wonder about the lowered degree of isolation.