I figured these toys would be replaced pretty quickly, but turns out they do the job for these small businesses and need very little maintenance. Moving the app to a new server instance is dead simple because there's basically just the script and the data file to copy over, so you can do OS updates and RAM increases that way. Nobody cares about a few minutes of downtime once a year when that happens.
There are good reasons why we have containers and orchestration and stuff, but it's interesting to see how well this dumb single-process style works for apps that are genuinely simple.
I think people are also too quick to add secondary data stores and caches. If you can do everything with a transactional SQL database + app process memory instead, that is generally going to save you tons of trouble on ops, consistency, and versioning issues, and it can perform about as well with the right table design and indexes.
For example: instead of memcache/redis, set aside ~100 MB of memory in your app process for an LRU cache. When an object is requested, hit the DB with an indexed query for just the 'updatedAt' timestamp (should be a sub-10ms query). If it hasn't been modified, return the cached object from memory, otherwise fetch the full object from the DB and update the local cache. For bonus points, send an internal invalidation request to any other app instances you have running when an object gets updated. Now you have a fast, scalable, consistent, distributed cache with minimal ops complexity. It's also quite economical, since the RAM it uses is likely already over-provisioned.
This is exactly the approach that EnvKey v2[1] is using, and it's a huge breath of fresh air compared to our previous architecture. Just MySQL, Node/TypeScript, and eventually consistent replication to S3 for failover. We also moved to Fargate from EKS (AWS kubernetes product), and that's been a lot simpler to manage as well.
I've never built something with this type of mechanism for a DB query, but it's interesting. I don't think I've ever timed a query like this, but I feel like it's going to be an "it depends" situation based on what fields you're pulling back, if you're using a covering index, just how expensive the index seek operation is, and how frequently data changes. I've mainly always treated it as "avoid round trips to the database" -- zero queries is better than one, and one is better than five.
I also guess it depends on how frequently it's updated: if 100% of the time the timestamp is changed, you might as well just fetch (no caching). Based on all the other variables above, the inflection point where it makes sense to do this is going to change.
Interesting idea though, thanks.
> For bonus points, send an internal invalidation request to any other app instances you have running when an object gets updated. Now you have a fast, scalable, consistent, distributed cache with minimal ops complexity.
Now you have to track what other app servers exist, handle failures/timeouts/etc in the invalidation call, as well as have your app's logic able to work properly if this invalidation doesn't happen for any reason (classic cache invalidation problem). My inclination is at this point you're on the path of replicating a proper cache service anyways, and using Redis/Memcache/whatever would ultimately be simpler.
>For example: instead of memcache/redis, set aside ~100 MB of memory in your app process for an LRU cache.
Erlang/Elixir for the win with (almost transparent multi-core) concurrency and ETS ;)
Apps like this tend to perform like an absolute whippet too (or if they dont, getting them to perform well is often a 5 line change). It's really freeing to be able to write scans and filters with simple loops that still return results faster than a network roundtrip to a database.
The problem is always growth, either GC jank from a massive heap, running out of RAM, or those loops eventually catching up with you. Fixing any one of these eventually involves either serialization or IO, at which point the balance is destroyed and a real database wins again.
Absolutely. The challenge is having enough faith that it will take long enough to catch up to you.
Statistically speaking, it won't catch up to you and if it does, it will take so long you should have seen it coming from miles away and had time to prepare.
In my systems that use an in-memory/append-only technique, I try to keep only the pointers and basic indexes in memory. With modern PCIe flash storage, there is no good justification for keeping big fat blobs around in memory anymore.
But a lot of small businesses are genuinely small. They may not sign up new customers that often. When they do, the impact to the service is often very predictable ("Amy at customer X uses this every other day, she's very happy, it generates 100 requests / week"). If growth picks up, there would be signs well in advance of the toy service becoming an actual problem.
An application I have written recently for personal use is a double entry accounting system after GNUcash hosed itself and gave me a headache. This is based on Go and SQLite. The entire thing is one file (go embed rocks) and serves a simple http interface with a few JS functions like it is 2002 again. The back end is a proper relational model that is stored in one .db file. It is fully transactional with integrity checks. To run it you just start program and open a browser. To backup you just copy the .db file. You can run reports straight out of SQLite in a terminal if you want.
This whole concept could scale to tens of users fine for LOB applications and consume little memory or resources.
I strongly suspect this approach scales to tens of thousands of users. Maybe 30-40k users would be my guess on a garden variety intel i5 desktop from the past 3 years or so.
I say this because that hardware (assuming NVMe storage) will do north of 100k connect + select per second (connect is super cheap in sqlite, you're just opening a local file), assuming 2-3 selects per page serve gets me to the 30-40k number. The http server side won't be the bottleneck unless there's some seriously intensive logic being run.
LONG ago I was amused by a Sun box in a closet that nobody knew anything about. I heard about the serial label printer that stopped working eight months ago, which was eight months after I shut off the Sun. I brought it back up again late one Friday, and the old/broken label printer magically worked again.
Now my stuff is that.
Is there logic in your app to potentially throw the last line away (incl. truncating it from the file) if it's invalid due to being the result of a non-atomic write? If so, seems a bit Not-Invented-Here compared to just using a (runtime-embedded) library that does that for you :)
from the sqlite about page. it's one of the most bullet proof and hardened pieces of software out there I think. it was made basically for exactly the ops use case of a file on disk. but who knows what the use case is for them. maybe writes to the db are few and far between so it's a fairly moot point.
These apps are on Digital Ocean, and I don’t remember ever having unplanned downtime with them. They do sometimes migrate instances with advance notice, but that’s a clean shutdown.
I’m sure SQLite is a better choice for almost any app. My reason for not using it was to try avoiding dependencies out of curiosity, and also that I honestly really don’t like writing SQL — it just feels boring and error-prone. (Like eating celery, I know objectively it’s good for me.)
It's a real SQL DB with real records and transactions, and it is one of the most trusted and reliable pieces of software ever made. Like, check out the change log to get a sense of how they do stuff.
Well done on building an easy-to-maintain single node app with few dependencies. You would be the SWE I would send prayers of thanks too after onboarding (and for not making me crawl through a massive Helm chart/CloudFormation template hell).
I've tried doing some thing in Python (my initial programming love) over the years but I keep going insane because every time I'm forced to read up on the state of ecosystem (choosing versions and package managers and whatnot) and it drives me insane. I just install Node+Express and can get to work immediately (and finish quickly).
I would sit down at an interview and try to create these "proper" system designs with boxes and arrows and failovers and caches and well tuned databases. But in the back of my mind I kept thinking, "didn't Facebook scale to a billion users with PHP, MySQL, and Memcache?"
It reminds me of "Command-line Tools can be 235x Faster than your Hadoop Cluster" at https://adamdrake.com/command-line-tools-can-be-235x-faster-... , and the occasional post by https://rachelbythebay.com/w/ where she builds a box that's just fast and with very basic tooling (and a lot of know-how).
For small batches you do an interval at a convenient time, such as a time of day where the hardware is undersubscribed for some other task. As the batches grow then you have an online system that continues to work until you get rather far down the list of consequences in queuing theory. Once you get 24 hours and 1 minute of tasks per day you never catch up (it never ceases to amaze me how often I can find someone who will fight me on this point), and you must be aware that substantially before that breaking point, you can experience rather long average queuing delays.
But if your workload is spiky, you can smear 250 minutes of peak traffic out over 6-18 hours with no problems at all. You need a safe place to stash the queue and a little sophistication around recovering from failures/upgrades. Those aren't necessarily Simple tools, but if that's the most complex part of your system you're doing pretty okay.
- Cassandra (based on dynamo/big table)
- wrote a custom KV store named RocksDb that is open source/now a company
- wrote a custom photos storage system that replaced an NFS based design
- wrote another custom binary object store
- wrote a custom geo distributed graph db (Tao)
- wrote an in house distributed FS replacement for HDFS
https://www.cs.cornell.edu/projects/ladis2009/papers/lakshma...
https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...
https://www.usenix.org/system/files/conference/osdi14/osdi14...
https://www.usenix.org/system/files/conference/atc13/atc13-b...
https://www.cs.princeton.edu/~wlloyd/papers/tectonic-fast21....
https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A...
What am I missing?
They key is to add the minimum amount of "stuff" to a simple design to convincingly scale for some new hypothetical need.
Additionally, understanding of how tolerant 99% of businesses are to real-world problems that could hypothetically arise can help one not frustrate over insane edge case circumstances. I suspect a non-zero number of us have spent time thinking about how we could provide deterministic guarantees of uptime that even unstoppable cosmic radiation or regional nuclear war couldnt interrupt.
I genuinely hope that the recent reliability issues with cloud & SAAS providers has really driven home the point that a little bit of downtime is almost never a fatal issue for a business.
"Failover requires manual intervention" is a feature, not a caveat.
We've lost our way in the masked marketing the cloud providers are creating to help us solve problems we will never encounter, unless we are building the next Netflix or Facebook.
https://www.techempower.com/benchmarks
If you just need plaintext services, something like ~7 million requests per second is feasible at the moment.
By being clever with threading primitives, you can preserve that HTTP framework performance down through your business logic and persistence layers too.
... reading blogs and such where some loud mouth is telling them about so called "best practices" and so they bring that back to work with them.
There are not enough loud mouths telling people to keep it simple (until you can't or know better).
As far as complexity... if you get big enough, you can't avoid it. My meta-rule is to only accept additional complexity if solving the issue some other way is impractical.
It is almost always far, far easier to add additional moving parts to your production environment than it is to remove them after they're in use.
1. CEOs/whoever that don't listen to how much additional complexity it is to build a system with extremely high uptime and demand it anyway.
2. Developers with past experience that systems going down means they get called in the middle of the night.
3. Industry expectations. Even if you're a small finance company where all your clients are 9-5 and you could go down for hours without any adverse impacts, regulators will still want to see your triple redundant, automated monitoring, high uptime, geographically distributed, tested fault tolerant systems. Clients will want to see it. Investors will check for it when they do due diligence.
Look at how developers build things for their own personal projects and you'll see that quite often they're just held together with duct tape running on a single DO instance. The difference is, if something goes wrong, nobody is going to be breathing down their neck about it and nobody is getting fired.
If the extra complexity is microservices and containers you might have an issue, but microservices are kind of a UNIX philosophy derivative, I'm not sure the complexity is really intentionally added(Like when someone uses an SPA framework or something), it just kind of shows up by itself when you pile on thousands of separate simple things without really realizing the big picture is a nightmare.
I was scarred by the DDoS of Linode on Christmas Day 2015 (as a Linode customer at the time). I believe that was the only time my Christmas was ever interrupted by work. Of course, one might respond that being the one perpetually on-call sysadmin isn't ideal.
The application itself is a total of 3 pages, encompassing maybe 20 endpoints at the most, with about 100 daily active users. For the backend, some genius decided to build a massive kubernetes stack with 74 unique services, which has been costing said company over $1K/month just in infra costs. It took me literally weeks to get comfortable working on the backend, and so much stuff has broken that I have no idea how to fix.
Not only that, but the company has never had more than 1 engineer working on it at a time (they're very small even though they've been around a bit). If there were such a thing as developer malpractice, I'd sue whoever built it.
In the cases I've seen this, honestly I think the ones to blame where the stakeholders, for hiring very young people, for the cheapest rate they could and giving them full responsibility and the Senior Architecure Something Something title when those people don't have more than a couple years experience and are just building what they read in a blog two weeks ago.
Just, wat.
Sounds like the architect was doing some resume driven development cause damn.
This has to be a crime.
Our architecture is extremely simple and boring - it would probably be more-or-less recognizable to someone from 2010 - a single Rails MVC app, 95+% server-rendered HTML, really only a smattering of Javascript (some past devs did some stuff with Redshift for certain data that was a bad call - we're in the process of ripping that out and going back to good old Postgres)
Our users seem to like it though, and talk about how easy it is to get set up. Looking at the site, the interactions aren't all that different from what we would build if we were using a SPA. But we're just 2 developers at the moment, and we can move faster than much larger teams just because there's less stuff to contend with.
Frankly out of all the things that make our architecture simple and efficient, I would say server rendered HTML is by far the biggest one.
Some front-end frameworks are closing this gap but I wouldn't necessary say they're equally a simple. See https://macwright.com/2020/05/10/spa-fatigue.html
In other words, choose the right tool for the job.
I guess they wanted me to use lots of little components in an SPA which I did in my day job, but it didn't seem nessisary for the task...
I could implement a Twitter in 1 Python or Go file, hosted on 1 machine
granted its concurrent user capacity and traffic load capacity would be insufficient for actual Twitter. but all the basics would work, in the small
We use one of these "aggressively simple" architectures too. At this point, I would quit my job instantaneously if I had to even look at k8s or whatever the cool kids are using these days.
I'm fine with complex architecture and would actually welcome someone choosing something complex but the issue is that we have perverse incentives at work to introduce stuff just to pad our resume.
Kubernetes was designed for companies deploying thousands of small APIs/applications where management is a burden. I've seen companies that deploy 3 APIs running Kubernetes and having issues...
RE the sqlalchemy concern; you do need to decide on where your transactions are going to be managed from and have a strict rule about not allow functions to commit / rollback themselves. Personally I think that sqla is a great tool, it saves a lot of boilerplate code (and data modelling and migrations are a breeze).
But overall the sentiments in this article resonate with my experience.
From their job pages:
Our stack :
backend: Python 3 (+ mypy)
API layer: GraphQL
android frontend: Kotlin/Jetpack
iOS frontend: Swift/SwiftUI
web frontend: TypeScript/React
database: Postgres
infrastructure: GCP / Terraform
orchestration: Kubernetes
That is not simple by any stretch of the imagination.The engineering message should be: keep your architecture as simple as possible. And here are some ways (to follow) on how to find that minimal and complete size 2 outfit foundation in your size 10 hoarder-track-suite-eye-sore.
Do we really need to be preached at with a warmed over redo of `X' cut it for me as a kid so I really don't know why all the kids think their new fangled Y is better? No we don't.
If you have stateless share nothing events your architecture should be simple. Should or could you have stateless share nothing even if that's not what you have today? That's where we need to be weighing in.
Summary: less old guy whining/showing-off and more education. Thanks. From the Breakfast club kids.
So far that just describes any good science paper or math textbook; the difference is that he's writing about questions of great interest to HN, like in this case "how to build a web service", "how to do software version tracking" (https://danluu.com/monorepo/), "how to do statistics" (https://danluu.com/linear-hammer/), "why hardware development is hard" (https://danluu.com/why-hardware-development-is-hard/), "why everything is broken" (https://danluu.com/nothing-works/), and "how to hire talented people" (https://danluu.com/talent/). These are topics where there is an enormous amount of hot air out there on the web, but very little that is epistemologically justifiable.
Minimalism is fine. But there comes a point when there's so little, it is nothing. danluu.com is a bucket of sand facing an overbuilt cathedral.
body { font: 16px/1.6em sans-serif; max-width: 50em; margin: auto; }
Can even add it manually in the inspector if you want.I do wish I didn't have to do that though.
In almost all performance areas -- gaming, PCs, autos, etc -- there are usually whole publications dedicated to performing benchmarks and publishing those results.
Are there any publications or sites which implement a few basic applications against various new-this-season "full stacks" or whatnot, and document performance numbers and limit-thresholds on different hardware?
Likewise, there must be stress-test frameworks out there. Are there stress-test and scalability-test third-party services?
Even for his architecture, it sounds like they have an API service, a queue and some worker processes. And they already have kubernetes which means they must be wrapping all of that in docker. It seems like a no-brainer to me to at least separate out the code for the API service from the workers so that they can scale independently. And depending on the kind of work the workers are doing you might separate those out into a few separate code bases. Or not, I've had success on multiple projects where all jobs are handled by a set of workers that have a massive `switch` statement on a `jobType` field.
I think there is some middle ground between micro-services and monoliths where the vast majority of us live. And in our minds we're creating these straw-man arguments against architectures that rarely exist. Like a literal single app running on a single machine vs. a hundred independent micro-services stitched together with ad-hoc protocols. Micro-services vs. monoliths is actually a gradient where we rarely exist at either ludicrous extreme.
It's quite easy these days to deploy an app using AWS Lambda, DynamoDB, SNS, etc., all with a single Cloud Formation template. Is that simple? In one sense I've abstracted away a lot of the operational work that comes with self-hosted, but now I've intertwined (Rich Hickey might say complected) myself into Amazon's ecosystem.
Also, is a document store like DynamoDB, MongoDB, etc., simpler than a relational database like Postgres? On the one hand, a document database's interface is very simple compared to the complexity like SQL. On the other, that simplicity is generally considered a necessary sacrifice to scale. If you don't need to scale, why make the sacrifice?
Also, there can be simple things that are better at scaling. Elixir is a very nice scripting language like Ruby or Python, but it also has much better performance scaling (comparable with NodeJS or Go).
What do people use instead of Graphene? Strawberry?
How often do you hit ‘that certain size’ when velocity starts to degrade anyway?
X works well until it doesnt… is not exactly a compelling argument. That can be said of simple and complex architectures, or just anything at all
So even if you’re a bit worried about scaling it, you can at least feel the problems are far away enough that you shouldn’t care until later.
I’ve been looking for real world performance.
What country could that be? That sounds challenging.
Normally I check the Internet Archive, but https://web.archive.org/web/*/https://danluu.com/simple-arch....
That links to the original on wave.com, dated March 9th this year.
And you're calling that simple?
I've worked on monolithic codebases, and the one thing none of them have ever been is simple. They have complex interdependencies (oh hey, like database transaction scopes); they have that 'one weird way of doing things' that affects every part of the system (like, 'everything has to be available over GraphQL')...