In a previous life, before microservices, CI/CD etc. existed, we did just fine with 20-30 CVS repositories, each representing a separate component (a running process) in a very large distributed system.
The only difference was that we did not have to marshal a large number of 3rd party dependencies that were constantly undergoing version changes. We basically relied on C++, the standard template library and a tightly version controlled set of internal libraries with a single stable version shared across the entire org. The whole system would have been between 750,000 - 1,000,000 lines of code (libraries included).
I'm not saying that that's the right approach. But it's mind boggling for me that we can't solve this problem easily anymore.
- Contract-first API development
- All API contract definition files (OpenAPI/Swagger, .proto, .wsdl...) in a single repo, which has a CICD pipeline to bundle them into artifacts for various platforms (Maven, Nuget, NPM, gem...)
- Consumers and producers import the "api-contracts" dependency; this is the only coupling between components
- Consumers and producers both generate necessary code (server stubs, client libraries) at build time
IMHO, if your service clients have dependencies on implementations of APIs rather than just the definitions, you're not realizing the key benefit of microservices (or SOA).
The article actually complains less about mono-repos and more about mono-repos on Git and the associated tooling around Git.
The article however states dependency managements as the main complain, to the point that it's mentioned immediately after the first point where monorepos are mentioned.
It would be nice if there was a tool that could help you identify just how much of each dependency you actually depend on so you could trim it.
Also go vendoring tools usually trim the repos down to the packages you import.
It does not have to be complicated.
The questions around "which repositories do I need?" and "how do I update all of them?" and "how do I make an atomic transaction [commit, branch, PR] across all of them?" are interesting questions in a multi-repo situation, but there are plenty of possible answers as well.
Some of them are just social in nature (read the README, watch/follow the whole GitHub organization, etc), so they aren't are as interesting technically as monorepo or "meta-repo" tools.
I think the tooling around this is fairly limited right now. I feel that most people are hoping docker caches stuff intelligently, which it doesn't. People should probably be using Bazel, but language support is hit-or-miss and it's very complicated. (It's aggravated by the fact that every language now considers itself responsible for building its own code. go "just works", which is great, but it's hard to translate that local caching to something that can be spread among multiple build workers. Bazel attempts to make all that work, but it basically has to start from scratch, which is unfortunate. It also means that you can't just start using some crazy new language unless you want to now support it in the build system. We all hate Makefiles, but the whole "foo.c becomes foo.o" model was much more straightforward than what languages do today.)
I contribute a lot to Nixpkgs, which is a monorepo with almost 50000 subcomponents [1], but because the build tool and CI track changes through hashes, changing a package only triggers rebuilds of other packages that depend on it and builds are super quick. It accomplishes this by heavily caching previous builds and sharing those between all builders.
No, monorepos are not going to work with a CI and build tool that always builds everything from scratch and does no caching. Instead, you should pick the right tool for the job, and go with a build system like Nix, Buck, Bazel or Please which were designed with monorepos in mind.
I think the second point the author makes, but only very briefly, is way more important to look at. Is git itself up to the job for such large repositories? One problem I've started running into in nixpkgs is that `git blame` takes considerable time to even execute, due to the enormous volume of commits in the repository. I would love to see a version control system that is optimised for storing lots of loosely connected components, and has better support for partial checkouts. I haven't found it yet, and I would love to hear what others are using for this.
I hear facebook has some modification of mercucurial. And Google probably created something themselves in-house. But is there anything open-source that supports these workflows at scale?
YAML in and of itself is not the easiest thing to parse when you have multiple layers of nesting and a lot of lines.
I don't really want to see what a CircleCI config would look like for Nixpkgs.
Once you get to the point of scaling your CI you're looking at tailored infrastructure to make sure you're only building what needs to be built.
Also proprietary systems like PlasticSCM.
(And would Git really have beaten Mercurial if GitHub had been HgHub instead? GitHub's success was more about process than the technology of Git, IMO.)
Hg is a much better user experience than git, that's for sure. Git won because of Github, which may have beaten any HgHub simply because Git has an actual API while Mercurial's "API" is "use subprocessing". In other words, if Mercurial gave a damn about the developer experience earlier on, it might well have won the war.
Git doesn't though. A bunch of shell scripts calling shell scripts calling a few native binaries is pretty much "use subprocessing". libgit came much later, it wasn't part of the original git.
However what git did provide was an open, stable, fairly simple and officially supported physical model with which you could easily interact directly, and protocols which either worked on that (file and "dumb http") or a relatively simple exchange protocol (the "pack protocol" https://github.com/git/git/blob/9b011b2fe5379f76c53f20b964d7...).
Hell, if anything hg's always provided more API than git, the extension model wouldn't be possible without it e.g. stdout coloration could be an hg plugin while it had to be implemented in each git command.
Bitbucket for Mercurial launched about the exact same time as GitHub.
I'm not following the call for something new, though:
> A source control that treats CI, CD, and releases as first-class citizens, rather than relying on the very useful add-ons provided by GitHub and its community.
I'm not a die-hard every-tool-should-do-exactly-one-thing-the-way-the-unix-gods-intended type of person, but in this case, I really feel that source control should stick to being source control. Hooks and add-ons are great precisely because things like CI and CD came after, and who knows what the new rage will be 5 or 10 years from now.
Building everything for today's workflow into a single tool means that by the time it's ready, the "today's" workflow won't be cool anymore, and we'll have other newer tools and processes that this new source control can't support :/
In my experience on teams at growing companies, I've seen pain points around continuous integration, configuration management, integration testing, dev/prod parity, feature flagging and releasing, provisioning staging servers in terms of pure tech/infra issues. Beyond that, I've seen more pressing general organizational issues around tech debt, software design collaboration, architectural debt, code review processes -- these are all pressing and valid concerns. But I just find the conclusions of this blog post flat out wrong. To conflate an unsatisfactory CI choice and configuration (which is totally reasonable) with a failure of version control is a pretty serious one. It doesn't fully disprove the thesis, but it certainly doesn't lend it support.
If you've installed a wheel onto a poorly set up suspension and get handling issues, does it mean you should reinvent that wheel, or does it mean you should check if your suspension may need some tuning?
I would LOVE something between subtrees and submodules.
I have explored this many times, and if I had the ability to write something like this, I would.
I would love it if I could have a child repo that did not require an external remote and could be bundled and stored within a parent repo, unlike a submodule. But I would also like it if it could be more decoupled from the history unlike a subtree.
I can get most of what I want from submodules and subtrees, but not really enough.
It might be possible without even having to change git. Perhaps if there were a way to have branch namespaces of some kind, and I could have a subtree have completely separate history, but have it checked out within the same working tree. Many of my projects that are submodules only make sense within their parent repo, and it is really redundant to have an external repo for them. But I also don't like to have to do expensive surgery to deal with subtrees, and it would be nice to not have it be completely merged.
My dream is to be able to drop a repo inside another repo and have git just treat it as if it were part of the parent repo. And then to be able to bundle the child repo to the parent and push it.
I know that it is mostly possible to do this already, but it is not easy or intuitive.
I'm not sure if I understand you right, but I think I made what you describe: https://github.com/feluxe/gitsub
It's a simple wrapper around git, that allows nested git repositories, with almost no overhead.
I use it for a private library (the parent-repo), which itself contains modules (the child-repos) that I open sourced on github. It works fine for my use case. I wrote it, because I found "submodule" and "subtree" too complicated. 'gitsub' is still in alpha.
I'm just very attracted to the idea of bundling repos together. I frequently use git-annex and datalad, and try to keep binaries and helper scripts in different repositories.
One could even use the existing submodule feature to reference unrelated history in the same repository, but the submodule tooling would want create a seperate .git folder, and duplicate everything, and would want a url to identify the repo, rather than knowing to use the parent repo.
It ought to be possible to modify git submodule to be able to specify that the submodule is actually just all the refs in namespace "blah" of the physically containing repo. Basically you would only need an expanded version of the .git "symlink file" feature that lets you specifify both "gitdir" and a ref namespace to use for all operations. Then poof, you would have self-contained submodules.
You would still have the problem that namespaced refs do not get cloned by default.
You also have some risk of pushing some refs of the parent that have module entries that reference objects only in unpushed refs in a ref namespace, meaning that if somebody cloned the repo, and tried to expand the self-contained submodules, it would discover that the commits are not present. I'd not be surprised if regular submodules also had that limitation. (I've never really used them in a manner where i might modify and commit in the submodule.)
While I do present a proof-of-concept implementation using hooks, a proper implementation would require some changes to the git client, I imagine.
It makes it so hard to read the remainder untainted by a certain amount of scepticism.
Fortunately, he's not actually saying anything much in the article so I don't think my irrational reaction to ignorance will mean I've missed something important. But still...
Many devs barely scratch the surface of what git can do anyway. Onboarding them on a few extra scripts seems better than an entirely new scm tool.
Git is great. It also has issues. Scaling has issues. Changing a tool won't solve scaling issues.
Anecdotally I think submodules work just fine, although the git submodule tool is not intuitive. Then again, I work in a very small team on small projects compared to these mammoths being discussed with monorepos and the like.
Opinions should ideally not be swayed by the volume of text making arguments, but by the coherence and logic of those arguments.
We are transitioning to multi-repos because we have been burned so hard by our mono-repo. Builds used to take 3 hours on the monster and we managed to get them down to 30 minutes, but we are truly at the bottom of the barrel. God help us if there is a build failure, every subsequent build fails while we scramble to identify the problem (and we can only sample success or failure every 30 minutes). It's a house of cards and it's horrible.
> Shots fired: multi-repos suck
We've already had debugging woes with this combined with internal package feeds (you have to pull down the code, build it, remove the package and replace it with the local code), which has made us very bearish on code re-use. That rigmarole sucks way less than mono-repos.
> You can’t have your cake and Git it too
Combine version control and package managers IMO. Go does one half of this. If you work under GOPATH with all your code, you can easily jump across repos to make changes and have those changes immediately propagate to the initial repo. Your hard-disk becomes the mono-repo. What Go doesn't do is pull binaries down from package feeds. There needs to be some simple mechanism to switch between builds and code.
I would also suggest that mono-repos work better with statically typed languages with module boundaries and visibility control. The problem of anything being able to touch anything else is not so bad when you can hide implementation details behind small APIs.
I have definitely felt some pain with having Ruby projects in a single repo using git, but much less so with Java projects using Hg.
[1]: https://bazel.build/
I take it the author has never had an interest in politics.
Ultimately the problem is in scaling the number of build inputs, not the number of .git directories.
It's tempting to imagine an integrated system where making changes to a piece of source code automatically commits every change, every commit will attempt to compile and build, every successful build auto-packages into a new artifact with a new build version. The language and the build system would ensure that all builds are reproducible. Because of this, all builds can be addressed by identity (content hash) too, not just a name and a build number within some namespace.
When any dependency of the current project has newer builds, one could choose to pull up an interactive diff experience to step through the code of newer versions. This would aid in selecting a different version on which to depend, if desired. If a different version of a dependency is picked up, a new build gets triggered too, and a successful build gets a new build version.
The strong linkage between source code revision and build version, the deterministic builds, and content-based artifact addressing work together to ease the traceability of changes and the reusability of artifacts, and sidesteps concerns about the hosting and namespacing of source code and build artifacts interfering with the project's "single source of truth", because any copy of an artifact, known by any name, irrelevant of its location, will share the same hash.
There will still be usability problems with such a system too. There would be no way to strip data out. A shelve, replay, and cherry-pick frontend would be necessary to allow the doctoring of input before it's committed permanently -- but in such a system, only permanently committed code can be built. The workflow to prepare a project for public consumption would be to author and test all the changes in a 'scratch' project that doesn't auto-disseminate its build artifacts elsewhere, and cherry-pick the changes into a public project. Public projects could only have public dependencies.
Configuration files, data files, and pieces making up a larger environment may need a different approach. Nonetheless, a lot of these problems take the same shape: some input should deterministically produce some output, and a running system may choose to alter its own state by interfacing with a stateful outside world (e.g. load or write files, communicate through a network). The sensible places of drawing a boundary between the inside world and outside world will differ for every use-case.
But allow me to retort these bald assertions presented in the article:
Monorepos are great.
Multirepos are great.
Git is the best source control system ever. And if you think it could do something better, well have I got news for you. It's completely open source and extendable with various script entry points and an easily accessibly API.
Thanks for reading my blog.
To be clear, I'm not disagreeing. But it is simply not good enough. Any new generation of source control needs to be able to do things that are difficult with Git, and Git simply isn't extensible enough. Microsoft has a Git VFS, and there's Git LFS, but this just doesn't go far enough.
There are good technical reasons why you would use Perforce or even Subversion these days.
The people who made Git made it for working on large, but not huge, open-source code repositories with a traditional model. It doesn't work so well for vendoring, it doesn't work well for artists, it doesn't have locking, it doesn't have access controls (and there's only so much you can add). You can argue that these features don't make sense or we're using Git "wrong" or I can write a bunch of hooks but at some point I just want them to work and I'm tired of fighting with Git to make it happen.
Just personal background, these days I work with closed source and open source, monorepos and multirepos, Git, Subversion, and Perforce all on a regular basis (and sometimes use weird custom setups). Git is by far the most familiar of the three, and I've published some tools for Git repo surgery.
Can you say more? What are some of those reasons? Or link to some data or examples?
so there's a lot of drawbacks to using gitolite but we were able to customise access controls down to allowing some users the ability to only change lines of checked-in config only to certain values
Suppose for example we have 2 distinct projects - a backend and frontend, which each have their own testing and deployment strategy. GitLab CI only allows one CI pipeline config per-repository. While we could take care of that with scripting, that can easily get out of hand as we increas the number of distinct "projects", if we wanted to maintain a monorepo. So the tooling encourages us to have separate repos.
However if we do that, since we don't have that convenient single commit hash that a monorepo gives us, then we don't have a good way to ensure that the deployments between projects are synced up, and rollbacks are far more complicated.
Its a contrived example (for instance we could switch to a different CI system and mitigate this issue), but it seems to me that whatever an organization chooses, mono- or poly-repo, they have to build complicated custom configurations and tooling to get over whatever tradeoffs their decision has. And as the number of logical projects (repos, submodules, etc.) and the commit rate increases, then the tooling has to increase in complexity to handle issues of scale.
So I guess the open question is, is there a way we can somehow have both without spending a bunch of engineering cycles writing custom configs and tools?
Thank you for this feedback - it’s something we’re thinking about a lot too. We’ve made some improvements for monorepos (`changes:` keyword) and for micro-services/multi-repo (`trigger:` and `dependency:` keywords) but we’re not satisfied!
We have two open Epics - one for making CI lovable for monorepos (https://gitlab.com/groups/gitlab-org/-/epics/812) and one for making CI lovable for microservices (https://gitlab.com/groups/gitlab-org/-/epics/813). Would love community feedback on the direction those will take us and how we can up level lovability even more.
sounds like you have a deploy & release issue, not a developing or publishing one. Octopus Deploy was the first system I saw that make a distinction between them, and it eliminated a swathe of issues by simply saying "a release is a set of versioned packages"
wow octopus deploy got expensive