Though it's true that monorepos without proper discipline can tend towards coupling. Yet, when discussing mono vs poly, we should keep this in mind.
I don't know how you maintain that arm's length separation if you don't have compilation units in your language of choice, and that may contribute to some of the muddiness in this kind of discussion. "It depends."
- [x] Namespaces and the like without much security benefit
- [x] Giant Java dependency
- [x] Strange syntax and glyphsI feel like if you are working completely in the open-source world, and you are contributing one open-source project to a larger array of available projects, then the decision to use a polyrepo makes a lot of sense. You can submit libraries to a package repository like Yarn/NPM/PyPI or you can use Git references for e.g. Go's package manager.
But what I experienced with polyrepos outside this world is that we ended up with a weird DAG of repos. It was always unclear whether a specific piece of code that was duplicated between projects should be moved into one dependency or another, or whether it should have its own repo. Transitive dependencies were no fun at all, if you used git modules you might end up with two copies of the same dependency. You might have to make a sequence of commits to different repos, remembering to update cross-repo references as you go, and if you got stuck somewhere you had to work backwards. This feels like a step backwards, like the step backwards from CVS to RCS.
Again, in the open-source world you might have some of this taken care of by using a package manager like Yarn. But if your transitive dependencies aren't suitable for being published that way, it can be tough. Monorepo + Bazel right now is a bit rough around the edges but overall it's reduced the amount of engineering time spent on build systems.
On the other hand, it's not like Bazel can't handle polyrepos. In fact, they work quite nicely, and Bazel can automatically do partial checkouts of sets of related polyrepos, if that's your thing.
As for VCS scalability problems, I expect that Git is really just the popular VCS du jour and some white horse will show up any day now with a good story for large, centralized repos with a VFS layer. In the meantime any company large enough to experience VCS performance problems but not large enough have their own VCS team (like Google and Facebook) will suffer, or possibly pay for Perforce.
If your project is mostly something like C++ (which has support built-in to Bazel) then the WORKSPACE rules will be much more manageable and partial checkouts become a lot easier.
I'd be more interested to read about a project or company that failed due to making one choice or the other. And then by switching things to the other way, things were fixed.
Otherwise, as someone who was worked with both, I imagine there are a host of other decisions that will be much determinant on your success.
Let's not get too wrapped up in what color to paint the shed.
Please don't do this.
Of course the Monorepo is not free of downsides, those mentioned in the original article are real, although a bit exaggerated in my opinion. VCS operations can be slow and scaling a VCS system is challenging, but possible. And the risk of high coupling and a tangled architecture is also very real if you don't use a dependency management system like Bazel/Buck/Pants.
But in my opinion the downsides of the Polyrepo are much worse and much much harder to fix. The main problem is that you need a parallel version control system like SemVer on top of your VCS. SemVer is fine for open source projects but for a dynamic organization is a nightmare because it is a manual process prone to failure. SemVer dependency hell is really hard to deal with and creates a lot of technical debt.
Additionally, once you go Polyrepo you lose true CI/CD. Yes, you still have CI/CD pipelines but those apply only to a fraction of the code. Once you get used to run `bazel test` and you know you will run every single test of any piece of code that could depend on the code you just changed, you never want to go back. Yes, you could have true CI/CD with Polyrepos, but it requires a lot of work and writing a lot of tooling that does not exist in the wild. It is cheaper to invest in scaling your VCS in a multi-repo.
If we had the tooling to do multirepo atomic commits and reviews then maybe we would of stuck with polyrepos, but it doesn't really exist out in the wild, so monorepo it was.
Maybe you can clear my confusion. If Module B is dependent on Module A, then every version of B should refer to a specific version of A, correct? What is there to break? Development can continue on A without interfering with B, and then you can uptick B once it points to a later A.
I'm not sure what this has to do with the mono/poly discussion.
To avoid that, you do 10 migration commits so everyone is on the latest version. If you're going to do that as standard operating procedure anyway might as well make it far easier and have a monorepo.
Adding additional PRs across different repos is functionally no different than the same PR with scattered dependencies in a monorepo, except that separating the PRs makes each isolated set of changes more atomic and focused, which has led to fewer bugs and better quality code review and, the hugest win, each repo is free to use whatever CI & deployment tooling it needs, with absolutely no constraints based on whatever CI or deployment tool another chunk of code in some other repo uses.
The last point is not trivial. Lots of people glibly assume you can create monorepo solutions where arbitrary new projects inside the monorepo can be free to use whatever resource provisioning strategy or language or tooling or whatever, but in reality this not true, both because there is implicit bias to rely on the existing tooling (even if it’s not right for the job) and monorepos beget monopolicies where experimentation that violates some monorepo decision can be wholly prevented due to political blockers in the name of the monorepo.
One example that has frustrated me personally is when working on machine learning projects that require complex runtime environments with custom compiled dependencies, GPU settings, etc.
The clear choice for us was to use Docker containers to deliver the built artifacts to the necessary runtime machines, but the whole project was killed when someone from our central IT monorepo tooling team said no. His reasoning was that all the existing model training jobs in our monorepo worked as luigi tasks executed in hadoop.
We tried explaining that our model training was not amenable to a map reduce style calculation, and our plan was for a luigi task to invoke the entrypoint command of the container to initiate a single, non-distributed training process (I have specific expertise in this type of model training, so I know from experience this is an effective solution and that map reduce would not be appropriate).
But it didn’t matter. The monorepo was set up to assume model training compute jobs had to work one way and only one way, and so it set us back months from training a simple model directly relevant to urgent customer product requests.
Had we been able to set this up as a separate repo where there were no global rules over how all compute jobs must be organized, and used our own choice of deployment (containers) with no concern over whatever other projects were using / doing, we could have solved it in a matter of a few days.
In my experience, this type of policy blocker is uniquely common to monorepos, and easily avoided in polyrepo situations. It’s just a whole class of problem that rarely applies in a polyrepo setting, but almost always causes huge issues with monorepo policies and fixed tooling choices that end up being a poor fit for necessary experiments or innovative projects that happen later.
Hear, hear. Let teams choose the processes and tools that work best for them. In previous release engineering positions, I resisted the many attempts to instroduce a single standard workflow for all projects. The support burden of letting a thousand flowers bloom was not great, but the benefit was that devs understood their project and were empoiwered to make changes when the business requirements changed faster than standardized tooling could.
We had a few contracts for standard behaviours, but they were low-overhead: must respond to 'make/make test', have a /status endpoint that 500'd when it was unhealthy, register a port in the service conf repo, etc.
It makes it less atomic if you need simultaneous changes in multiple repositories.
> Had we been able to set this up as a separate repo where there were no global rules over how all compute jobs must be organized, and used our own choice of deployment (containers) with no concern over whatever other projects were using / doing, we could have solved it in a matter of a few days.
I think this was an organisational problem, but I accept the argument that monorepos will provide a seed around which such pathologies can crystallise. But I don't believe it's the only such seed and I don't think it's an inevitable outcome from monorepos.
Unless you mean your presubmit test would push to production machines, that's bad and shouldn't be allowed, but again has nothing to do with a monorepo.
A company could just as easily have draconian policy about testing and deployment and multiple repos. Maybe you could break the rules (hell you could have broken the rules in monorepo land), but again, that's just a rules issue, not an issue of the repository.
Both monorepos and polyrepos have advantages and disadvantages. Many factors — scale, overall team quality and experience, level of integration between projects are a few that come to mind — will affect how much those advantages and disadvantages matter to any given company at any given point in time. The right choice for you isn't necessarily the right choice for me.
Much more important than which approach you choose is understanding, and accepting, the consequences of your choice. You'll want to extract value out of the advantages, you'll need to mitigate the disadvantages. You won't be able to adopt tools and processes meant for the other approach without some degree of friction.
All this forcing people to do things the Right Way (my way) is surely part of the pushback against monorepos.
But set that aside for the moment. Let's suppose defaults should force people to do things the Right Way, and that we also know what the Right Way is.
Instead of letting anyone sloppily depend on any code checked into the monorepo, shouldn't we force people to think long and hard about contracts between components -- the default concern in a polyrepo architecture? When and how to make contracts, when and how to break contracts? Isn't this how Amazon moved past their monorepo woes, adopted SOA, built AWS, and became one of the largest companies on earth? Heck, isn't this how the Internet itself was built?
It's not that it's a single right way to do it. There isn't, and anyone who tells you there is has something to sell you, or is inexperienced enough to not have seen enough of the problem domain.
What is for certain: teams need to have tooling that causes the conversations and behavior that lead to the outcomes we want. As systems and teams scale large enough, this tooling becomes essential - without it, teams go their own way, and in so doing, may or may not create the culture needed for the outcomes you want.
I have never once in my career, so far, had to tell a team to communicate less. When we're talking about engineering organizations that are large enough to diverge, you must solve these problems somehow, and it needs to be systemic and intentional.
Your post puts a lot of the onus on A for breaking B, C, and D, but I think equal care and consideration needs to come from the other side of the contract. Eg, What are you depending on? Is it a dependency you want to take on, or are you and the shared code likely to diverge in life? These are top of mind decisions in a polyrepo architecture, but from my experience they're often not even considered in a monorepo. Anything checked in is fair game for reuse. This is why I suspect you may be "forcing" the wrong thing.
For reference I've worked in companies large and small, both monorepo and polyrepo. When I worked on Windows back in the 00's the monorepo tooling (SourceDepot) was quite amazing for the time, but the costs of that sort of coordination were also painfully apparent to everyone.
The place I currently work has a monorepo for desktop software and polyrepos for everything else. It isn't a straight up A/B experiment, but anecdotally the pain is higher and shipping velocity lower in the monorepo half of the world. Most of the monorepo pain is related to CI or other costs of global coordination, the kind of things Matt touches on midway (albeit probably too subtlely). I'd be interested to see your counterarguments to those points as well. Do you need fancy dependency management tooling to make your global CI builds fast and reproducible? Matt argues those end up being equivalent to the kind of dependency tooling that's intrinsic to polyrepo architectures anyway.
Fighting back against monorepo design is dangerous - embrace experimentation.
What's dangerous about it? Monorepos have a lot of benefits, and should absolutely be considered. Maybe even by most. But right now in the community it's almost pushed as the "only true way with all benefits and no drawbacks", and that's absolutely not true. To the point the knowledge of why and how to poly repo is already starting to get lost.
That's dangerous.
What do you even mean by "dangerous"? To a business? To your health?
What is the deal with people trying to make these sorts of global assertions in a vacuum about what's "good" and "bad"? This doesn't make any engineering sense in any way to me. You have a problem and you figure out the best way for your business to solve that problem given some bounded resources. Nothing in the basic problem solving process (scientific method?) necessitates all the arbitrary "should" axioms. Why don't people just analyze their specific situation and figure out a solution?
It's like people arguing vehemently about the optimal design that every company "should" be using for all windshields for all personal vehicles on the road, without even remotely discussing various vehicle body shapes and sizes.
Have you (or anyone reading this thread) encountered similar issues? How do you solve them in a monorepo?
My feelings here are apart from your tool of choice (Pypi) so read them with that in mind.
Why are you dependent on 3rd party code that isn't in your repo? I am a huge advocate of the monorepo and vendoring. Depending on your tooling of choice and your workflow checks for updates on this third party code should be frequent (security) and done by someone qualified (not a job for the "new guy").
The question is where should this start and end? The answer (for me) is everything and I have elected to use less (and reduce complexity) to avoid bloat. Really though this is an artifact of my use of Git: https://unix.stackexchange.com/questions/233327/is-it-possib... --
Git had a sparse checkouts feature since a long time, but it only affected the checkout itself, all the blobs would still be synced.
Now, Git is gaining good monorepo capabilities with the git partial clone feature [1]. Their idea is that with them you can only clone the parts of a repository that are interesting to you. This has been brewing for a while already but I'm not sure how ready it is. There doesn't seem to be user-level documentation for it yet, to my knowledge, so I am linking to the technical docs.
[1]: https://github.com/git/git/blob/master/Documentation/technic...
You can certainly achieve this with Perforce, SVN, HG, any repo system there too.
Linux: FUSE + ?
Windows: Dokan? CBFS? Or the new fangled https://docs.microsoft.com/en-us/windows/desktop/projfs/proj... which VFSForGit uses
Let me give a concrete example. The Android open source project (AOSP) which builds the system of Android devices has the code size close to the scale of tens of GB (let alone all the histories!). It is already a massive monorepo in itself. And typically you would have many of them from different OEM/SoC vendors of different major releases. In such a scenario, it would turn into 'a monorepo of monorepos,' which is quite unpleasant to imagine.
With 100 engineers a monorepo might seem a good idea. With 500 it becomes nearly impossible to do anything involving a build. Some isolation is needed.
Also from my experience many engineers just don't give a shit about architecture. They create entangled mess, that kind of works for the customer, and go home. Without some enforced isolation it is impossible to maintain it.
That being said I am more inclined to polyrepos.
Today, not quite. I work for a multi billion dollar tech company and we have several thousand repos (and it's awesome)
Both FB and Google have more than 500 devs and are using a monorepo.
This would help people working on smaller apps, since they don't need to look at other apps unless they're working on shared library code.
Of course, once you are working on library code, you have to build and test all the apps that use it. But even at Google, the people working on the lowest levels of the system can't use the standard tools anyway.
I don't see why you'd need semver. The apps could sync to a particular commit in the library repo.
More to the point, as the author of TFA allows, once a system reaches a certain size, nobody can understand it all. At some point you have to engage division of labor /specialization, and once you do that, it doesn't make sense to have just anybody randomly making changes in parts of the code-base they don't normally work in.
I'd rather see a poly-repo approach, with a designated owner for discrete modules, but where anybody can clone any repo, make a proposed fix, and submit a PR. Basically "internal open source" or "inner source"[1].
In my experience, this is about as close as you can get to a "best of both worlds" situation. But, as the author of TFA also says, you absolutely can make either approach work.
It was a gigantic pain trying to find owners for half-dead repos for services still running and in use, where the original authors had left years ago & from teams 4 or 5 restructures ago. The one thing I learned was: never make a user the owner of a repo (unless it is in their personal space), always find a team to accept responsibility for it.
This is how it works at my company. The issue we run into is that PRs coming from non-core maintainers tend to either get over-scrutinized (e.g. "this diff may work for you but it's not generic enough for X/Y/Z") or flat out ignored at the code review stage and sometimes don't land in a timely enough manner.
Another challenges with this approach is when you have deeply nested dependencies and need to "propagate" an upgrade in some deep dep up the tree. In the JS/Node world, this usually means fixing an issue involves hacking on transpiled files in the node_modules folder of a project to figure out what change needs to be made, and then mirroring said change into the actual repo and then tweaking things until type checking/linting/CI pass. Not really conducive for collaboration.
One other problem is that security/bug fix rollouts are a bit more challenging. We had a case a while back where a crash-inducing bug was fixed and published but people still experienced crashes due to not having upgraded the one out dozens of packages required by their projects.
Here's my rule: You break it, you fix it.
> I'd rather see a poly-repo approach, with a designated owner for discrete modules, but where anybody can clone any repo, make a proposed fix, and submit a PR.
I'd rather see pairing, extensive tests and fast CI. I see PRs as a necessary evil, rather than a good thing in themselves. If I make a change that breaks other teams, I should fix it. If I can make a change to fix code anywhere in the codebase, I should write the test, write the fix and submit it.
Small, frequent commits with extensive testing creates a virtuous cycle. You pull frequently because there are small commits. You are less likely to get out of sync because of frequent pulls. You make small commits frequently because you want to avoid conflicts. Everyone moves a lot faster. I have had this exact experience and it is frankly glorious.
I’ve seen this invoked so many times to shirk responsibility though. Someone piles up all kinds of crap in a tight little closet, complete with a bowling ball on top, and the next unsuspecting dev who comes by and opens it gets an avalanche of crap falling on them while the original author can be heard somewhere in the background saying “it’s not my problem.”
This winds up leading to more crap-stacking just to get the work done ASAP and you wind up with a mountain of tech debt.
I like the zero flaw principle where new feature work stops until all currently known flaws are fixed. Then everyone is forced to pitch in and responsibility is shared whether you want it or not.
Unfortunately some of the most popular CI/CD services out there(Travis, Circle, etc) don't even support cross-repo pipelines, much less mono repo builds.
Those both look way more in the weeds than what I would have imagined.. I guess for Bazel at least it makes sense given Googles scale how fine-grain they would get into caching and incremental builds..
For my needs a simple tool that would allow discovering "WORKSPACES" and constructing a build graph based on what's changed, while handing off the actual building to some entry point in the workspace, would be good enough. Have a weird collection Gradle projects, node projects, test suites, docs, and etc with their own build processes already in place.
Some things are also on a "critical" path while others can run async given the context(branch, tag, etc)...
I'm rambling though.
I find it enjoyable how plenty of comments both here and in the other discussion are of people saying "We had a mono/polyrepo and things improved tremendously when we migrated towards a poly/monorepo". The issue might be one of growth and complacency: a drastic change like that forces the team to face the technical debt that was being ignored and do a better implementation using what was learned from past mistakes.
Perhaps the fact that since their level was now higher, they wouldn't have to deal with the nitty gritty details and pain of working with a monorepo as a developer?
E.g. I wasn't for it when I was a dev, but now that I can just impose it on others, I love it. Same with how various 'development process' rituals are adopted...
How does the library team know which consumers a commit may break? What tools are recommended?
With a monorepo, the basic effort you have to put in to start scaling is quite high. To properly do a local build, you need bazel or something. But bazel doesn't stop at just building, but it manages dependencies all the way down to libraries and stuff. Let's say you're using certain maven plugins, like code coverage, shading, etc. Would bazel have all the build plugins your project needs? Most likely not. You have to backport a bunch of plugins from maven to bazel and so on. Guess how many IDEs support bazel? Not a lot.
Then you need to run a different kind of build farm. When you check-in stuff to a monorepo, you need to split and distribute one single build. Compared to a polyrepo where one build == one job, a monorepo is like one build == a distributed pool of jobs, which again needs very deep integration with the build tool (bazel again here), to fan out, fan in across multiple machines, aggregate artifacts, and so on.
Then the deployment. Same again. There is no "just works" hosted CI or hosted git or anything for monorepos. People still dabble with concourse or so on.
And guess what, for a component in its own repo, you don't need to do anything. Existing industry and OSS tooling is built from ground up for that. Just go and use them.
To provide a developer a "basic experience" to go from working on, building and deploying a single component – the upfront investment you need to provide with a monorepo is very high. Most companies cannot spend time on that, because scale means different things to different companies. There is a vast gap in the amount of ops/dev tooling you have for independent hosted components vs monorepo tools. Just search for "monorepo tools" or DAG and see how many you can come up with. So what really happens with a monorepo is, most companies go with multi-module maven and jenkins multi-job. The results are easy to predict. I'm not saying that maven/jenkins are bad, but they are _not_ sophisticated, and are not anywhere close to what Twitter/Facebook/Google or any modern company uses to deal with a monorepo (for a good reason). They are just not good at DAG. If you're relying on maven+jenkins as your monorepo solution, all I can say is "good luck".
Instead, if you start by putting one component in one repo, you keep scaling for _much longer_ before you hit a barrier.
In principle, monorepos are better. In practice, they don't have the basic "table stakes" tooling that you need to get going. Maybe monorepo devops tooling is a next developer productivity startup space. But until then, it's not mainstream for very good reasons.
How do the "global build tools" play with language specific build tools?
My primary stack is Rust and Scala. Both have excellent build capabilities in their native tools. How well do pants/bazel integrate with them? I wouldn't want to rewrite complex builds nor would I expect these tools to have 100% functionality of native ones.
I know the Scala rules are used in production by multiple companies. Rust support is improving quickly, but it's not perfect. See the dedicated GitHub repositories for more information.
(I work on Bazel)
I'd say that open-source best practices for shared libraries are appropriate if you're making an open-source shared library. However, these practices are inappropriate for internal libraries, proprietary libraries, and other use cases. In my experience, it's also far from "problem solved". You can point your finger at semantic versioning but in the meantime we go through hell and back with package managers trying to manage transitive library dependencies and it SUCKS. Why, for example, do you think people are fed up with NPM and created Yarn? Or why people constantly complain about Pip / Pipenv and the like? Why was the module system in Go 1.11 such a big deal? The answer is that it's hard to follow best practices for shared libraries, and even when you do follow best practices, you end up with mistakes or problems. These take engineering effort to solve. One of the solutions available is to use a monorepo, which doesn't magically solve all of your problems, it just solves certain problems while creating new problems. You have to weigh the pros and cons of the approaches.
In my experience, the many problems with polyrepos are mostly replaced with the relatively minor problems of VCS scalability and a poor branching story (mostly for long-running branches).
Why do you say so?
I agree share library style makes more sense in most cases though. The main problem with it is forcing everyone to use the latest library versions but that isn't insurmountable by any means.
Personally, I think there’s a place for mono repos and there’s a place for smaller independent repos. If a project is independent and decoupled from the rest of the tightly coupled code base (for instance things which get opesourced), it makes no sense to shove it into a huge monorepo.