In particular you've designed a very complicated system: Operationally you need an etcd cluster and a tailetc cluster. Code-wise you now have to maintain your own transaction-aware caching layer on top of etcd (https://github.com/tailscale/tailetc/blob/main/tailetc.go). That's quite a brave task considering how many databases fail at Jepsen. Have you tried running Jepsen tests on tailetc yourself? You also mentioned a secondary index system which I assume is built on top of tailetc again? How does that interact with tailetc?
Considering that high-availability was not a requirement and that the main problem with the previous solution was performance ("writes went from nearly a second (sometimes worse!) to milliseconds") it looks like a simple server with SQLite + some indexes could have gotten you quite far.
We don't really get the full overview from a short blog post like this though so maybe it turns out to be a great solution for you. The code quality itself looks great and it seems that you have thought about all of the hard problems.
What do you mean by this part? tailetc is a library used by the client of etcd.
Running an etcd cluster is much easier than running an HA PostgreSQL or MySQL config. (I previously made LiveJournal and ran its massively sharded HA MySQL setup)
The typical workflow is do do all of your "reads" out of the keyspace, attempt to apply Etcd transactions, and (if needed) block until your keyspace has caught up such that you read your write -- or someone else's conflicting write.
Oh. Since they have a full cache of the database I thought it was intended to be used as a separate set of servers layered in front of etcd to lessen the read load. But you're actually using it directly? Interesting. What's the impact on memory usage and scalability? Are you not worried that this will not scale over time since all clients need to have all the data?
What if you used one of the managed RDBMS services offered by the big cloud providers? BTW, if you don't mind sharing, where are you hosting the control plane?
Plus maybe replacing the sync.Mutex with RWMutexes for optimum read performance in a seldom-write use case.
On the other hand again, I feel a bit weird criticizing Brad Fitzpatrick ;-) — so there might be other things at play I don‘t have a clue about...
If you are expecting to simply persist one instance of one application's state across different runs and failures, a database can be frustrating.
But if you want to manage your data across different versions of an app, different apps accessing the same data, or concurrent access, then a database will save you a lot of headaches.
The trick is knowing which one you want. Persistence is tempting, so a lot of people fool themselves into going that direction, and it can be pretty painful.
I like to say that rollback is the killer feature of SQL. A single request fails (e.g. unique violation), and the overall app keeps going, handling other requests. You application code can be pretty bad, and you can still have a good service. That's why PHP was awesome despite being bad -- SQL made it good (except for all the security pitfalls of PHP, which the DB couldn't help with).
In the OP they spent two weeks designing and implementing transaction-save indexes -- something that all major SQL RDBMS (and even many NoSQL solutions) have out of the box.
Loss of efficiency? Come on, you were using a file before! :-)
Article makes me glad I'm using Django. Just set up a managed Postgres instance in AWS and be done with it. Sqlite for testing locally. Just works and very little engineering time spent on persistent storage.
Note: I do realize Brad is a very, very good engineer.
Having no dedicated database server or even database instance, being able to persist data to disk with almost no additional memory required, marginal amount of CPU and no heavy application dependencies can be considered very efficient depending on context.
Of course, if you start doing this on every change, many times a second, then it stops being efficient. But then there are ways to fix it without invoking Oracle or MongoDB or other beast.
When I worked on algorithmic trading framework the persistence was just two pointers in memory pointing to end of persisted and end of published region. Occasionally those pointers would be sent over to a dedicated CPU core that would be actually the only core talking to the operating system, and it would just append that data to a file and publish completion so that the other core can update the pointers.
The application would never read the file (the latency even to SSD is such that it could just as well be on the Moon) and the file was used to be able to retrace trading session and to bring up the application from event log in case it failed mid session.
As the data was nicely placed in order in the file, the entire process of reading that "database" would take no more than 1,5s, after which the application would be ready to synchronize with trading session again.
This was my main thought throughout reading it. So many things to consider and difficult issues to solve it seems they face a self-made database hell. Makes me appreciate the simplicity and stable performance of django orm + postgre.
First they're conflating storage with transport. SQL databases are a storage and query system. They're intended to be slow, but efficient, like a bodybuilder. You don't ask a bodybuilder to run the 500m dash.
Second, they had a 150MB dataset, and they moved to... a distributed decentralized key-value store? They went from the simplest thing imaginable to the most complicated thing imaginable. I guess SQL is just complex in a direct way, and etcd is complex in an indirect way. But the end results of both are drastically different. And doesn't etcd have a whole lot of functional limitations SQL databases don't? Not to mention its dependence on gRPC makes it a PITA to work with REST APIs. Consul has a much better general-purpose design, imo.
And more of it doesn't make sense. Is this a backend component? Client side, server side? Why was it using JSON if resources mattered (you coulda saved like 20% of that 150MB with something less bloated). Why a single process? Why global locks? Like, I really don't understand the implementation at all. It seems like they threw away a common-sense solution to make a weird toy.
I think we're pretty well aware of the pros and cons of all the options and between the team members designing this we have pretty good experience with all of them. But it's entirely possible we didn't communicate the design constraints well enough. (More: https://news.ycombinator.com/item?id=25769320)
Our data's tiny. We don't want to do anything to access it. It's nice just having it in memory always.
Architecturally, see https://news.ycombinator.com/item?id=25768146
JSON vs compressed JSON isn't the point: see https://news.ycombinator.com/item?id=25768771 and my reply to it.
I get that SQLite wouldn't work, but it also doesn't make sense to have one completely independent database per process. So I imagine you're using a shared database, at which poitlnt etcd starts to make more sense. It's just not that widely understood in production as sql databases, and has limitations which you might reach in a few years.
SQL - I'm not sure what the problems are with SQL. But it is like a second language to me so maybe I experienced these problems long ago and have forgotten about them.
ORM - I never use an ORM, so I have no idea what the problems might be.
testing latency - I don't know what this refers to.
dependency - ditto
https://twitter.com/_rchase_/status/1334619345935355905
It’s just 2 files.
Sometimes it’s better to focus on getting the product working, and handle tech debt later.
The real problem with that project was of course trying to set up a microservices architecture where it wasn't necessary yet and nobody had the right level of experience and critical thinking to determine where to separate the services.
This tool needs to insert data in the middle of (pretty short) lists, using a pretty complicated algorithm to calculate the position to insert at. If I had used an RDBMS, I'd probably have to implement fractional indexes, or at least change the IDs of all the entries following the newly inserted one, and that would be a lot of code to write. This way, I just copy part of the old slice, insert the new item, copy the other part (which are very easy operations in Go), and then write the whole thing out to JSON.
I kept it simple, stupid, and I'm very happy I went with that decision. Sometimes you don't need a database after all.
[0] Via mutex in your case. Have you thought about durability, though. That one's actually weirdly difficult to guarantee...
Strictly speaking, it's literally impossible to guarantee[0], so it's more a question of what kinds and degrees of problems are in- versus out-of-scope for being able to recover from.
0: What happens if I smash your hard drive with a hammer? Oh, you have multiple hard drives? That's fine, I have multiple hammers.
is this a typo? 150MB is such a minuscule amount of data that you could do pretty much anything and be OK.
But writing out 150MB many times per second isn't super nice when both 150MB and the number of times per second are both growing.
I am also going to go out on a limb and guess that this is all running in kubernetes. Running etcd there is dead simple compared to even running something like Postgress.
Congrats on a well engineered solution that you can easily test on a dev machine. Running a DB in a docker container isn't difficult but it is just one more dev environment nuance that needs to be maintained.
At that point I thought "hmm, i guess I know databases" and a few years later, attempted to store some slightly larger, more complicated data in MySQL and query it. My query was basically "join every record in this table against itself, returning only rows that satisfy some filter". It ran incredibly slowly, but it turned out our lab secretary was actually an ex-IBM Database Engineer, and she said "did you try sorting the data first?" One call to strace showed that MySQL was doing a very inefficient full table scan for each row, but by inserting the data in sorted order, the query ran much faster. Uh, OK. I can't repeat the result, so I expect MySQL fixed it at some point. She showed me the sorts of DBs "real professionals" designed- it was a third order normal form menu ordering system for an early meal delivery website (wayyyyy ahead of its time. food.com). At that point I realized that there was obviously something I didn't know about databases, in particular that there was an entire schema theory on how to structure knowledge to take advantage of the features that databases have.
My next real experience with databases came when I was hired to help run Google's MySQL databases. Google's Ads DB was implemented as a collection of mysql primaries with many local and remote replicas. It was a beast to run, required many trained engineers, and never used any truly clever techniques, since the database was sharded so nobody could really do any interesting joins.
I gained a ton of appreciation for MySQL's capabilities from that experience but I can't say I really enjoy MySQL as a system. I like PostgresQL much better; it feels like a grownup database.
What I can say is that after all this experience, and some recent work with ORMs, has led me to believe that while the SQL query model is very powerful, and RDBMS are very powerful, you basically have to fully buy into the mental model and retain some serious engineering talent- folks who understand database index disk structures, multithreading, etc, etc.
For everybody else, a simple single-machine on-disk key-value store with no schema is probably the best thing you can do.
* our data is tiny and fits in RAM
* our data changes often
* we want to eventually get to an HA setup (3-5 etcd nodes now, a handful of backend server instances later)
* we want to be able to do blue/green deploys of our backend control server
* we want tests to run incredibly quickly (our current 8 seconds for all tests is too slow)
* we don't want all engineers to have to install Docker or specific DB versions on their dev machines
Curiously though its a pattern I've seen twice in the last 12 months, there was that guide on the good bits of AWS that also recommended starting with a single host with everything running on it.
Maybe we should all move that host back under our desks and really be back to basics!
I had no idea companies of this size had engineers with that much free time on their hands.
I don't even mind senior devs putting together things like this at the cornerstone of the company provided there are always at any given point in time 2 people that know how it works and can work on it, and sufficient time was spent looking at existing solutions to make that call. It should be made with full expectations that the first paragraph is inevitable.
Specifically, in this case: Without any actual data (# of reads, # of writes, size of writes, size of data, read patterns, consistency requirements) it is not possible to judge whether going custom on such a system was merited or not. I would find it VERY difficult to come to the conclusion that this use case couldn't be solved with very common tooling such as spark and/or nats-streaming. "provided the entire dataset fits in memory" is a very large design liberty when designing such a solution and doesn't scream "scalability" or n+1 node write-consistency to me. I say this acknowledging full well that etcd is an unbelievably well written piece of software with durability and speed beyond it's years.
Keeping my eyes open for that post-series-a-post-mortem post.
Uh, you compressed it first, right? because CPUs can compress data faster than disk I/O.
But the bigger problem was the curve. Doing something O(N*M) where N (file size) and M (writes per second) were both growing was not a winning strategy, compression or not.
The only feature it misses is to compress the data that is not actively in use, that way there is really not much of a downside.
People seem to keep using poorly-designed ORMs or are stuck with some strange anti-ORM ideology.
Modern ORMs are fast, efficient, and very productive. If you're working with relational databases then you're using an ORM. It's a question of whether you use something prebuilt or write it yourself (since those objects have to be mapped to the database somehow). 99% of the time, ORMs generate perfectly fine SQL (if not exactly what you'd type anyway) while handling connections, security, mapping, batching, transactions, and other issues inherent in database calls.
The 1% of the time you need something more complex, you can always switch to manual SQL (and ORMs will even let you run that SQL while handling the rest as usual). The overall advantages massively outweigh any negatives, if they even apply to your project.
The right tool is a wrapper / DSL over SQL, which allows to interact with the database in a predictable and efficient way, while not writing placeholder-ridden SQL by hand. Composable, typesafe, readable.
ORMs do fine in small applications without performance requirements. The further you go from that, the less adequate an ORM becomes, and the more you have to sidestep and even fight it, in my experience.
The only reason objects are the "wrong" abstraction is because they don't match relational models exactly. That impedance mismatch is the entire reason for the object-to-relational mapping, otherwise you can use things like document-stores and just serialize your objects directly.
A ORM that worked on the principle of insert query text of any complexity receive object as the primary usecase, not the "nonstandard and non-idiomatic usecase" would be the only way to ease the concerns of dba's who code like me.
Its the same pitfall of api clients. Why would I take the time to learn an api like its a sdk, along with the pains of trying to shunt openapi's libraries in to my application without requiring the creation of a composer build step, further complicating deployment, when I can make 5 methods in the time before lunch to do the bits i need as rest queries and deployment of my php app remains as simple as `git pull production` on the nfs share all the workers read from?
The benefit of compile validated symbols is moot in the days of test driven dev, so the benefits gained from that can still be realized without creating build complexity or making competent engineers re-learn something they already know only re-abstracted in a way that almost always makes it harder for somebody who understands the low level to learn the new way compared to a new dev.
Like I said, you're either using an existing ORM or just writing your own everytime, and the one you write probably won't be very good as seen by the numerous security and performance bugs that are constantly found. Also abstractions are useful. All of software development is built with abstractions and they don't suddenly become useless when it comes to databases.
I also don't see what this has to do with API clients or build steps, but there are good and bad examples of those too.
What "modern ORM" are you using?
Sure I don't write PHP, but the ORMs I've used in Java/Kotlin, Ruby, JavaScript and lately Rust and Crystal have all been more than fine with me writing raw SQL.
The author seems to be using Go, which honestly could use work in that area. gorm is the biggest / most popular ORM out there, but it looks like a one-person project, the author seems well worn-out already, and it kinda falls apart when you work with a data model more than one level deep.
Plus broadly speaking, there seems to be a bit of an aversion to using libraries in the Go community.
Which ones do you have experience with?
In our case, the control plane data store really should be as boring as possible. It was real stretch using anything other than MySQL. We tried to lay out the arguments in the post, but the most compelling was we had lots of unit tests that spun up a DB and shut it down quickly. Maybe a hundred tests whose total execution time was 1.5s. The "boring" options made that surprisingly difficult.
(Tailscaler and blog post co-author)
Of course the benefit of what you did, even if I wouldn't have done it, is that you're _not_ using a different system in dev vs prod.
> Never underestimate how long your “temporary” hack will stay in production!
Is this actually easier than using SQLite?
I imagine it's like if you were trying to be an olympic marathon runner, you'd study things like humidity and shoes and arm motion deeply, and he kind of does that (function naming, nodes per block, comments, test coverage, et cetera). Even if you don't agree with him, as you suggest read it and play "devil's advocate", and you will be forced to think about details that will make you a better software engineer.
Probably not so, bc tailscale is a decent product, but this post did not change my view in a good way
> Through this process we would do major reorganizations of our SQL data model every week, which required an astonishing amount of typing. SQL is widely used, durable, effective, and requires an annoying amount of glue to bring into just about any programming language
and
> So we invested what probably amounts to two or three weeks of engineering time into designing in-memory indexes that are transactionally consistent
Sounds to me like someone has learned a lot on the job. Good for him, but it looks exactly like what I said before.
No. Not unless your data is relational. This is a common problem, relational databases have a lot of over head. They are worth it when dealing with relational data. Not so much with non relational data.