We worked on a small project where we put together statistical measures for codebases. It was a lot of fun, even if the infrastructure was out of my wheelhouse at the time.
Folks that can manage billion-line codebases are on a whole different level I think. I wonder sometimes how many folks like that there are.
EDIT: Looks like he left for a bit and is now back. Good on him!
That said, there are some open source pieces to help. Facebook open sourced their mercurial stuffs so you can get version control at scale (and before then you just use perforce). Google open sourced bazel. Google open sourced some parts of the underlying infra behind code search, but not enough to really work properly. And of course lower level there's a plethora of reasonable db offerings, etc.
It would still require a lot of glue though.
It really is not a small difference.
Kidding aside, my point is Google recognizes obvious boundaries between e.g. their web stuff and android, and organizes their code accordingly.
I feel like these debates are often fueled by false arguments. Either way you go, you're going to want to build support tools and processes to tailor your VCS to your local needs.
First, VCS ACLs will massively reduce the benefits you're supposed to get from a monorepo. How will you do global refactors in that kind of a situation? How does a maintainer of a library figure out how the clients are actually using it? (The clients must have visibility into the library, but the opposite it unlikely to be true.)
Second, let's say that I maintain a library with a supported public interface that's implemented in terms of an internal interface that nobody's supposed to use. How will VCS ACLs allow me to hide the implementation but not the interface? When they kick off a build, the compiler needs to be able to read the implementation parts to actually build the library. It can't be that the clients have access to read the headers but then link against a pre-build binary blob. At that point you don't have a monorepo, you've got multirepos stored in a monorepo.
The actual solution are build system ACLs. Not ACLs for people, but ACLs for projects. Anyone can read the code, but you can say "only source files in directory X can include this header" or "only build files in directory Y can link against this object file".
> How will VCS ACLs allow me to hide the implementation but not the interface?
If you don't give people access to the code, they can't build it. So what? Publish pre-built binaries from your CI system back to source control.
> At that point you don't have a monorepo, you've got multirepos stored in a monorepo.
I think it's a spectrum. It would be stupid to dogmatically stick to either extreme. You modify things in a pragmatic fashion to solve the problems you're facing. In my experience, starting with a monorepo and making exceptions as needed has worked better than the alternative.
Your post sounds similar to a lot of the multi/mono repo discussions. You've focused on one problem and one way to solve that problem without considering that there are many ways to work around it. Neither approach is going to be pain-free and both require tooling for special scenarios.
I agree
> The actual solution are build system ACLs.
Or, maybe, better languages enforcing better design. In most of the cases artifacts and libraries are not related to the domain, engineers create them just to establish artificial boundaries between code components, isolate irrelated things, enforce encapsulation and avoid accidental mixing of metalanguages.
It would be lot better to have a smart compiler for this.
A tool which can prevent us from mixing different abstraction layers, creating unneccessary horizontal links between our components, etc, etc.
I have a couple of ideas how such a thing may look like.
It's a blogpost and the author didn't try to build a total and exhaustive formal system. These shortcomings are not absolute truth but actually they are true.
I've seen this multiple times: a small projects evolves over years into a monster. Engineers add new components and reuse any other components they may need creating horizontal links. At some point they feel like they lost their productivity and they blame monorepo because it's easy to create horizontal links in a typical monorepo. So, they try to build a multirepo flow and they spend a lot of effort, time and money trying to make it working. At some point they feel that their productivity is even worse than it was before because now they need to orchestrate things so they merge everything back.
Same applies not only to VCS flows, but to system design as well.
When we discuss monolith/microservices controversy all the monorepo/multirepo arguments may be isomorphically translated to that domain. What is better, monolithic app or a bunch of microservices? A role-based app of course: https://github.com/7mind/slides/blob/master/02-roles/target/...
Monorepo advocates are typically advocating for microservices, but within a single code base.
The way you provide access control is through code review and build system visibility.
In order to modify another group's code you require their approval on the review for that section of the code base. (Using mechanisms like github/gitlab owners files or rules within upsource.)
This still means that if one group needs to make extensive changes to another groups code, the path of least resistance may be to fork it into your own group's section of the repo.
Build tools provide another point of control. If you're using a tool like bazel, the way you link to a component in another portion of the repo is through target names. The only targets your code will have access to are those that the owners has declared as being available for external builds.
Sure you could just use a manyrepo style of dependency tracking in a monorepo but I think that's not exactly what the author is exploring.
From what I read that is a correct assessment. What the OP is proposing is something of a strawman argument. No advocate of monorepos I've ever met believe that a monorepo should imply a monolith.
Generally they're advocating monorepos in order to develop microservices faster, and with less effort. Using a monorepo and the associated tooling side steps the pain that comes from complicated CI, the difficulty of sharing code, the difficulties of non-atomic cross-repo reviews, and the difficulties of making multi-app refactorings.
>Surely source control is for source code?
This is just pedantry. Checking in binaries is a pragmatic solution that solves a lot of problems.
Git's design can limit its usefulness in this respect - though perhaps you could solve this to some extent with git LFS? - but not all version control systems have this problem.
Even though it's one project.
Even though they refuse to allow a release of a single component - it must all be released together without forwards/backwards compatibility.
I think most of of the time, the mono/multi debate is spoiled by people who feel they can have their cake and eat it too.
It works fairly nicely with meson, as you can simply checkout a worktree of a library into a subprojects directory, and let individual projects move at their own paces even if you don't do releases for the libraries/common code.
It's not really clear why having to update every consumer in sync with library changes is beneficial. Some consumers might have been just experiments, or one off projects, that don't have that much ongoing value to constantly port them to new versions of the common code. But you may still want to get back to them in the future, and want to be able to build them at any time.
It's just easier to manage all this with individual repos.
I think the majority of projects in this world only update everything at once. They haven't investing in testing, sensible api's and testing to allow updating small pieces of their solution.
From my experience, I also think the majority of people who think they have a library and need multi repos to deal with that, don't have a library.
To further clarify, one user of your library means you could stop pretending you have a library and avoid the pain.
I don't mean to insist these problems do not exist, I simply don't think many people have them.
I have typically left mobile iOS/Android in separate repos however - they have a different deployment cadence, so you need to manage breaking changes differently anyway.
I for one find it refreshing that people are willing to think about different workflows, even if they are different.
It feels like what is described is a cross between a good language package manager and git submodules. It's an interesting space to explore, because a lot of nice things come out of submodules, but it's not a proper package manager.
A proper dependency manager that puts code in a workspace and manages it as you are working on it in a non clunky way is not something we have right now and may be a game changer. Thanks for sharing to the authors.
On the surface, most people seem to think of a monorepo as a source control management system that exposes all source code as if it's a traditional filesystem accessed through a single point of entry. Multirepo, in contrast, seems to be about multiple points of entry.
But that's a superficial and uninteresting distinction. All the hard parts of managing code remain for both and, for a sufficiently large organization, you'll still need multiple dedicated teams to build tooling to make either work at scale. All the pros listed in the article need a team to make them work for either approach, and all the cons are a sign that you need a team to be make up for that deficiency for either approach.
Aesthetically a single point of entry appeals to me, in that it allows for a more consistent interface to code. But I'd go for good tooling above that in a heartbeat.
I built my engineering staff to focus on any of the initiatives that my boss hands to me (changes week/week) - so we went monorepo so we could move between those projects/apps/programs quickly.
We knew that we didn't want to pay the maintenance cost just because microservices/multirepo was a buzzword AND we wanted future ventures to get faster (example: we solved identity for authn/authz once and now every app that needs it after can leverage it and we can upgrade identity and all of its consumers in one pull request).
In a monorepo your builds are at the same point in time horizontally across all of your dependencies. You build together or not at all (though not necessarily at HEAD). In a multirepo you have the option to build against any (or some subset of) point-in-time snapshots of your dependencies on a dependency-by-dependency basis.
If you have a single monorepo that all of the code is in, but your build system allows you to specify what commit to build your dependency build targets at instead of forcing you to use the same commit as your changes, you actually have a multirepo. If you have a bunch of repos but you build them all together in a CI/CD pipeline that builds each at it's most recently released version then you actually have a monorepo.
I don't see it used very often though. Why not?
It was introduced to counterbalance what many saw as a big mess. Result was a lot of process being introduced which slowed everything down, but that was probably necessary at that stage. To my knowledge the company keeps switching back and forth- but new projects that need to move fast typically are done independently still.
Can you have two metarepos, each with its own set of checked-out branches of the same original submodules?
So many questions, but they're all about identifying change and only deploying change...
Unless one enforces perfect one-to-one match between repo boundaries and deployments, this is also an issue with multirepos.
In practice, it's straightforward to write a short script that deploys a portion of a repo and have it trigger if its source subtree changes and then run it in your CI/CD environment.
With respect to hiding, git has sparse checkouts that can give you a limited view of a repository (for performance reasons - not for security reasons)
But that's just today's git. Other VCSs like perforce provide much finer grained access control and hiding.
When last I was there things were finally beginning to burst at the seams. Platform architecture migrations were failing or being abandoned over too many untracked dependencies on specific versions of platform-provided libraries. (RHEL5, anyone?) Third-party had become a jungle of unmaintained libraries with dozens of versions that nobody ever upgraded, that may or may not have security vulnerabilities or known bugs, and many teams hadn't released new versions of their clients/libraries into Live for years in fear of breakage. The Builder Tools team was talking about giving up and abandoning both Brazil and Live as unsalvageable. Framework teams (Coral) were throwing their hands up in the air about how Coral-dependent services would not be able to upgrade to Java 11 without fixing a bunch of breaking changes that they would never agree to fix. The solutions being proposed to these problems by the Builder Tools team looked a lot like moving toward a monorepo, at least conceptually.
It was also a huge day-to-day quality of life improvement for the users (the developers.) There are UX problems with git, but they pale in comparison to the UX problems with perforce which is truly unpleasant software.
For CD, we have scripts that ask what service you want to build, and they specifically package that service using the set of files & processes dedicated to that service. The build generates a versioned artifact. After that, repo doesn't matter at all, we're just moving service artifacts around.
Not that you can’t still make your changes backwards compatible with themselves. But if I’m going to have to deploy everything in two steps anyway, what’s the point?
Some background: at my current place of employment I have 28 services, should be 30 in the next few days, and so I think my use current case is very representative of a small to medium monorepo. At my last job right before this one we had sort of a monorepo that was strung together with git submodules although each project was developed independently with it's own git repo+ci.
> Isolation: monorepo does not prevent engineers from using the code they should not use.
Your version control software does not prevent or allow your developers from using code they should not use. It is trivial to check in code that does something like this:
import "~/company/other-repo/source-file.lang" as not_mine;
Or even worse in something like golang: import "github.com/company/internal-tool/..."
Because of this it is my opinion that it is impossible to rely solely on your source control to hide internal packages/source/deps from external consumers. That responsibility, of preventing touching deps, has to be pushed upwards in the stack either to developers or tooling.> So, big projects in a monorepo have a tendency to degrade and become unmaintainable over time. It’s possible to enforce a strict code review and artifact layouting preventing such degradation but it’s not easy and it’s time consuming,
I think my above example demonstrates this is something that is not unique to monorepos. The level of abstraction that VCS' operate at is not ideal for code-level dependency concepts.
> Build time
Most build systems support caching. Some even do it transparently. Docker's implementation of build caching has, in my experience, been lovely to work with.
---- Multi repo section ----
> In case your release flow involves several components - it’s always a real pain.
This is doubly or tripply true for monorepos because the barrier of cross-service refactors is so low. Due to a lack of good rollout tooling most people with monorepos release everything together. I know my CI essentially does `kubectl apply -f`. Unfortunately, due to the nature of distributed compute, you have no guarantee that new versions of your application won't be seen by old versions (especially so of 0-downtime deployments like blue-green/red-black/canary). Because of this you constantly need to be vigilant of backwards compatibility. Version N of your internal protocol must be N-1 compliant to support zero-downtime deployments. This is something that new members of monorepo have a huge huge difficulty working with.
> It allows people to quickly build independent components,
To start building a new component all one must do is `mkdir projects/<product area>/<project name>`. This is a far lower overhead than most multi-repo situations. You can even `rm -r projects/<product area>/<thing you are replacing>` to completely kill off legacy components so they don't distract you while you work. The roll out of this new tool whet poorly? Just revert to the commit before hand and redeploy and your old project's directories, configs, etc are all in repo. Git repos present an unversioned state that inherently can never be removed f you want a source tree that is green and deployable at any commit hash.
--- Their solution ---
I accomplish the same tasks as a directory structure. As mentioned before if you just put your code into a `projects/<product area>/<project>` structure you can get the same effect they are going for by minimizing the directory layout in your IDE's file view. The performance hit from having the entire code base checked out is very much a non-issue for >99% of us. Very very few of us have code bases larger than the linux mainline and git works fine for their use cases.
Also, any monorepo build tool like Bazel, Buck, Pants, and Please.build will perform adequately for the most common repo sizes and will provide you hermetic, cached, and correct builds. These tools also already exist and have a community around them.
[0] - https://docs.microsoft.com/en-us/azure/devops/learn/git/git-...