I've been working on an ephemeral/preview environment operator for Kubernetes(https://github.com/pier-oliviert/sequencer) and as I could agree to a lot of things OP said.
I think dev boxes is really the way to go, specially with all the components that makes an application nowadays. But the latency/synchronization issue is a hard topic and it's full of tradeoff.
A developer's laptop always ends up being a bespoke environment (yes, Nix/Docker can help with that), and so, there's always a confidence boost when you get your changes up on a standalone environment. It gives you the proof that "hey things are working like I expected them to".
1. New shared builds update container images for applications that comprise the environment
2. Rather than a "devbox", devs use something like Docker Compose to utilize the images locally. Presumably this would be configured identically to the proposed devbox, except with something like a volume pointing to local code.
I'm interested in learning more about this. It seems like a way to get things done locally without involving too many cloud services. Is this how most people do it?
Considering the cost of a developers time, and you can do shenanigans to drive that even lower, this all feels totally reasonable.
It's darkly amusing how we have all these black-magic LLM coding assistants but we can't be reasonably assured of even 2000s level type-aware autocomplete.
What? For which languages are you talking about? For python, VSCode is leaps and bounds ahead of PyCharm if your project is well typed.
JetBrains offer a remote solution now though: https://www.jetbrains.com/remote-development/gateway/
I don't recall latency being a big problem in practice. In an organization like this, it's best to keep branches up to date with respect to master anyway, so the diffs from switching between branches should be small. There was a lot of work done to make all this quite performant and nice to use. The slowest part was always CI.
To me the root issue is the complexity of production environments has expanded to the point of impacting complexity in developer environments just to deploy or test - this is in conjunction with expanding complexity of developer environments just to develop - i.e. web pack.
For very large well resourced organizations like Stripe that actually operate at scale that complexity may very well be unavoidable. But most organizations are not Stripe. They should consider decreasing complexity instead of investing in complex tooling to wrangle it.
I'd go as far as to suggest both monorepos and dev-boxes are complex toolchains that many organizations should consider avoiding.
It became clear to me that cloud-only is not the way to go, but instead a local-first, cloud-optional approach.
https://mootoday.com/blog/dev-environments-in-the-cloud-are-...
I should be able to launch a local VM using the GitHub Desktop App just as easily as I can an Azure-hosted instance.
By running a Linux VM on your local machine you get a consistent environment that you can ssh to, remove the latency issues but you remove all the complexity of syncing that they’ve created.
That’s a setup that’s worked well for me for 15 years but maybe I’m missing some other benefit?
* Local dev has laptop-based state that is hard to keep in sync for everyone. Broken laptops are _really hard_ to debug as opposed to cloud servers I can deploy dev management software to. I can safely say the oldest version of software that's in my cloud; the laptops skew across literally years of versions of dev tools despite a talented corpeng team managing them.
* Our cloud servers have a lot more horsepower than a laptop, which is important if a dev's current task involves multiple services.
* With a server, I can get detailed telemetry out of how devs work and what they actually wait on that help me understand what to work on next; I have to have pretty invasive spyware on laptops to do the same.
* Servers in our QA environment can interact with QA services in a way that is hard for a laptop to do. Some of these are "real services", others are incredibly important to dev itself, such as bazel caches.
There's other things; this is an abbreviated list.
If a linux VM works for you, keep working! But we have not been able to scale a thousands-of-devs experience on laptops.
I’m sure there are a bunch of things that make it the right choice for Stripe. Obviously if you just have too many things to run at a time and a dev laptop can’t handle it then it’s a dealbreaker. What’s the size of the cloud instances you have to run on?
So if this was a problem back then, when the company had less than 1000 employees, I can't even imagine how hard would it be to get local dev working now
Not saying it's the wrong choice for you, but it's a choice, not a natural conclusion.
The amount of time companies lose to broken development environments is incredible. A developer can easily lose half a day (or more) of productive time.
With cloud environments it’s much easier to offer a “just give me a brand new environment that works” button somewhere. That’s incredibly valuable.
I don’t doubt that Stripe have a setup that works well for them them but I also bet they could have gone done a different path that also worked well and I suspect that other path (local VMs) is a better fit for most other smaller teams.
It also centralizes dev environment management to the platform team that owns them and provides them as a service which cuts down on support tickets related to broken dev environments. There are certainly some trade offs though and for most companies a local VM or docker compose file will be a better choice.
And the dev environment stops running when you close the laptop, but you also don't need it since you're not developing.
Not saying it can work for absolutely all cases but it's definitely good enough for a lot of cases.
Never in my life did I want to scale my dev. environment vertically or horizontally or in any other direction. Unless you work on a calculator, I don't know why would you need that.
I have no problems with my environment stopping when I close my laptop. Why is this a problem for anyone?
For overwhelming majority of programming projects out there they fit on a programmer's laptop just fine. The rare exceptions are the projects which require very specialized equipment not available to the developers. In any case, a simulator would be usually a preferable way to dealing with this, and the actual equipment would be only accessed for testing, not for development. Definitely not as a routine development process.
Never in my life did I want development process to be centralized. All developers have different habits, tastes and preferences. Last thing I want is to have centralized management of all environments which would create unwanted uniformity. I've been only once in a company that tried to institute a centrally-managed development environment in the way you describe, and I just couldn't cope with it. I quit after few month of misery. The most upsetting aspect about these efforts is stupidity. These efforts solve no problems, but add a lot of pain that is felt continuously, all the time you have to do anything work-related.
However, in some situations you must endure the pain of doing this. For example, regulatory reasons. Some organizations will not allow you to access their data anywhere but on some cloud VM they give you very botched and very limited control over. While, technically, these are usually easy to side-step, you are legally required to not move the data outside of the boundaries defined for you by the IT. And so you are stuck in this miserable situation, trying to engineer some semblance of a decent utility set in a hostile environment.
Another example is when the infrastructure of your project is too vast to be meaningfully reduced to your laptop, and a lot of your work is exploratory in nature. I.e. instead of typical write-compile-upload-test you are mostly modifying stuff on the system you are working on to see how it responds. This is kind of how my day-to-day goes: someone reported they fail to install or use one of the utilities we provide in a particular AWS region with some specific network settings etc. They'd give me a tunnel to the affected cluster, and I'd have some hours to spend there investigating the problem and looking for possible immediate and long-term solutions. So, you are essentially working in a tech-support role, but you also have to write code, debug it, sometimes compile it etc.
The idea here is that you use a VM (cloud or local) to run your compute. Most people can run it in the background without explicitly connecting to it.
Or just run Linux on your local machine as the OS. I don't get the obsession with Macs as dev workstations for companies whose products run on Linux.
Or Guix, which has the advantage of a more pleasant language.
As we've started to use it more extensively, we've also found that we want to add some enhancements, work out some bugs, and experiment with our own customizations out-of-tree, etc. I'm happy to report here on HN that devenv is well-documented and easy to extend for Nix users who have some experience with Nix module systems, and that Domen is really responsive to PRs. :)
And for the large majority of the companies/projects, if your project is so complex and heavy of resources that it doesn't fit on a modern laptop, the problem is not in the laptop, it's in the whole project and the culture and cargo-cult around "modern" software development.
Are there any more recently ex-Stripe folks here willing and able to comment on how Stripe's developer environment might have evolved since the OP left in 2019?
The biggest difference not mentioned is the article is that code is no longer kept on developer machines. The sync process described in the article was well-designed, but also was a fairly constant source of headaches. (For example, sometimes the file watcher would miss an update and the code on your remote machine would be broken in strange ways, and you'd have to recognize that it was a sync issue instead of an actual problem with your code.) As a result, the old devbox system was superseded by "remote devboxes", which also host the code. Engineers use VSCode remote development via SSH. It works shockingly well for a codebase the size of Stripe's.
There are actually several different monorepos at Stripe, which is a constant source of frustration. There have been lots of efforts to try to unify the codebase into a single git repo, but it was difficult for a lot of reasons, not the least of which was the "main" monorepo was already testing the limits of the solution used for git hosting.
Overall, maintaining good developer productivity is an extremely challenging problem. This is especially true for a company like Stripe, which is both too large to operate as a "small" company and too small to operate as a "big" company. Even with a well-funded team of lots of super talented people putting forth their best efforts, it's tough to keep all of the wheels fully greased.
Especially given VSCode, or Cursor ;), work so well via ssh.
To the engineers that don't want to use those IDE's it might suck temporarily, but that's it.
* Code is off of laptops and lives entirely on the dev server in many (but not all) cases. This has opened up a lot of use cases where devs can have multiple branches in flight at once.
* Big investments into bazel.
* Heavier investment into editor experiences. We find most developers are not as idiosyncratic in their editor choices as is commonly believed, and most want a pre-configured setup where jump-to-def and such all "just work".
I don't think it has to do with the dev environment itself, but I'd blame such thing for allowing to deliver "too fast" without thinking twice. Combine that with new blood in management and that's an accident waiting to happen *
They're the best in business still, but far from the well-designed easy-to-use API-first developer-friendly initial offering.
* Pure speculation based on very evident patterns
Though I am under the impression that things have gotten more sensical internally over the last year or so.
Note also that the devprod team has largely been shielded from the craziness, and may still be making good decisions (but I don't know what they are in this realm personally).
The worst problem is refining the ignore settings to ensure only code is synced preventing conflicts on derivative files and that some rule doesn’t overlap code file names.
We ran back then a similar project that I coined "Developer On-Demand" to tackle that same problem space. It's also what eventually lead me to find the magics of Nix and then build Flox.
I also agree with a lot of what was shared in other comments, while the problems we tackled at large orgs such as Facebook, Shopify, Uber, Google (to name a few teams I remember working with) and obviously also Stripe, certain areas of the pain are 100% universal regardless of team size.
On the Flox side, we're trying to help with a few of them today and many more hopefully in the soon future, very open for thoughts! Things like - simple to use Nix for each of your projects + keep deps and config up to date across everyones Macbooks and Linux boxes, etc -- even if you don't have a full AWS team and Language Server team ready to support.
I grew up with mainframes, minis and unix batch andor multiuser machines; for me this is the best way for business applications. I didn't particularly like the move to local all that much.
Among other things, Brisk allows you to run tests for your local code changes in the cloud (basically the pay mini test piece but for any test runner)
We also have a sync step much like the one described here and allow users to run one off commands (linters, tsc etc)
how does this work for interactive debugging?
I was going to ask the same about the system in TFA but I might as well ask you :)
That also avoids hacky sync scripts.
They don’t work from your local development env and also work in your CI env.
Mostly Brisk was designed to run your complete test suite on every codes save (ie local save) but it also works great from your CI.
We can run entire test suites in seconds which is performance you don’t get with those systems you named (which are generally for building/compiling)
> Finally: the development experience, of course, is only part of the story: the full lifecycle of code and features continues onward into CI and code review and ultimately through deployment into production, where it will be further observed, debugged, and evolved. Writing about those systems would require further posts at least this long.
In case the author is around: I would love to read those!
What Stripe’s configuration introduced is that they used a remote LS instead of the default local LS. Regardless, VS Code already defers LSP communication until it feels idle, and developers are used to that. So I wouldn’t expect a remote LS to significantly impact the level of inconsistency that developers already accept when using a local LS.
The choice to run dev environment far away from the files puts you in the position of needing to engineer your way past the inconsistency.
It comes up that we should make a devprod for front end folks to make the backend abstracted more.
Overall a lot of people prefer local dev because it gives them access to the entire stack, lets them run branch images easier, and has better performance than remote boxes.
https://moov.io/blog/education/moovs-approach-to-setup-and-t...
Bigger than shoppify's?
Also on a headcount level, Google tells me Shopify has 3,500 employees to Stripe's 9,500. Obviously neither company is compromised entirely of engineers, so this is a ballpark estimate.
GitHub feels like the real case where there might be a larger codebase. It's in the middle for employees (6,500), but it's existed longer than Stripe (though not as much longer as my gut feeling told me, interestingly)
I thought they used active_merchant
> currently amounting to over 15 million lines of code spread across 150,000 files
The monorepo has only gotten bigger over the last two years (source: I work at Stripe).
In general it was pretty rare, in my experience. The code bases were pretty well modularized.
This is so important when deciding to re-invent the wheel. I've gotten bitten by this many times.
The author mentions the codebase was Ruby, but I didn't see if they talked about Rails.
Or if the framework is barely noticeable at that scale and doesn't really matter anymore. That's the impression I get for Instagram (which was built with Django).
From post, the problems that justified central dev boxes are roughly: 1. dependency / config mgmt / env drift on laptops 2. collaboration / debugging between engineers 3. compute scaling + optimization 4. supporting devs with updates and infra changes
The last one is particularly interesting to me, because supporting the dev env is separate engineering role/task that starts small and grows into teams of engineers supporting the environment.
I'm helping build Flox. We're working on these pain points by making environments (deps, vars, services, and builds) workable across all kinds of Mac/Linux laptops and servers. 1) a. Virtualize the pkg manager per-project b. Nix packages can install across OS/arch pretty well 2) Imperative actions like `flox install`/`upgrade` always edit a declarative env manifest.toml -- share it via git 3) less Docker VM's -- get more out of devteam Macbooks 4) reduce toil with a versioned, shareable envs --> less sending ad-hoc config and brew commands to people (as mentioned in the post.) Just `git pull && flox activate`.
I think on problem point #2, collab tools are advancing to where, pairing on features, bugs, and env issues can be done without central SSH. (ex: tmate, vscode liveshare, screensharing, etc) -- however, that does sort of fall apart on laptops for async debugging of env issues (ex: when devprod is in the US, and eng is in London). Having universal telemetry on ephemeral cloud dev-boxes with a registry and all of the other DNS and SSH goodies could be the kind of infra to aspire to as your small teams run into more big-team problems.
In the Stripe anecdote, adopting the centralized infra created new challenges that their devprod teams were dedicated to supporting: - international latency from central, US-based VM's - syncing code to the dev boxes (https://facebook.github.io/watchman/) - linting, formatting, generating configs (run it locally or serverside?) - a dev workflow CLI tool dedicated to dev-box workflows and sync'ing with watchman's clock - IaaS, registry, config, glue for all the servers
This is all very non-trivial work, but maybe there's a future where people can win some portability with Flox when they are small and grow into those new challenges when it's truly needed -- now their laptop environments just get a quick `flox activate` on some new, shiny servers or Cloud IDE's.
I really like the notes from the author on how useing Language Server Protocol across a high latency link has great optimizations that work along side the watchman sync for real-time code editing.
Is basically the summary for most mono/multi repo discussions, and a bunch of other related ones.
With a monorepo, it's common to have a team focused on tooling and maintaining the monorepo. The structure of the codebase lends itself to that
With a multirepo codebase, it's usually up tu different teams to do the work associated with "multirepo issues"— orchestrate releases, handle dependencies, dev environment setup, etc. So all that effort just kinda gets "tucked away" as overhead that each team assumes, and isn't quite as visible
If there's anything I'd say to low-level execs, the kind that end up with a few hundred developers under them, it's that mis-sizing the tooling team, in one way or the other, comes with total productivity penalties that will appear invisible, but will make everything expensive. Understanding how much of a developer's day is toil is very important, but few really try to figure that out.
I think a lot of this is just type of thing comes because with a monorepo you can actually see the problems to solve whereas you can easily end up with the same N engineers firefighting the same problems K times across all your polyrepos.
I understand that "engineers" may not mean "developers", it could DevOps, site reliability and all the bits and pieces that make up a large service provider, but over a 1000?
Can someone please enlighten me?
Won't be surprised to see that many would probably need a safari map or README documentation in every single folder to navigate a repository as large as stripes.
Sounds like an emergence of a new bad practice if you are having to praise how large your code base is.
No different to having thousands of smaller repos instead.
I personally dislike monorepos, for very niche, in-the-weeds operational reasons (as an infra person), but their ergonomics for DX cannot be understated.
Or are there any other aspects to the monorepo architecture that make it beneficial for large companies like that?
Just curious, I've never worked in such an environment myself.
When several of the world’s most successful software companies use this approach, it’s hard to argue that it’s inherently bad. Of course it’s sensible to discuss what lessons apply to smaller companies who don’t have the luxury of dedicated tooling teams supporting the monorepo and dev environment.
Is...documentation a bad thing?