I had the misfortune to use MongoDB at a previous job. The replication protocol wasn't atomic. You would find partial records that were never fixed in replicas. They claimed they fixed that in several releases, but never did. The right answer turned out to be to abandon MongoDB.
> Did any of you actually read the article? We are passing the Jepsen test suite and it was back in 2017 already. So, no, MongoDB is not losing anything if you know what you are doing.
https://twitter.com/MBeugnet/status/1253622755049734150?s=20
Can you imagine saying the phrase "if you know what you are doing," in public, to your users, as a DevRel person? Unbelievable.
- The system warns about unsafe usage at either compile time or runtime, and you ignore at your peril.
- The system does not warn, but official documentation is consistently verbose about what is required for safety.
- Official documentation isn’t consistently helpful and can be downright dangerous, but the community picks up the slack.
- The company gaslights the community into believing it is possible for a non-core-team member to “know what they are doing” from one of the above levels when Jepsen provides written evidence that this is not true.
I’m fine with things that are the third level from the top. I like to live dangerously. But I don’t think anyone can look at that last level and say “people are giving informed consent to this.”
However I can _quite easily_ see how a non-native English speaker could use the phrase “if you know what you are doing” to mean “if you are careful”.
I'm much more concretely worried by a software design for which the authors (not hostile critics) consider "if you know what you are doing" an acceptable safety and quality standard for data integrity.
I imagine things are better now.
I pretty much refuse to deploy a new instance of it now, I've been burned too often.
As an intern at Shopify, I got an email from MongoDB asking us to switch. Shopify was 10 years old the time. Plus several coworkers would also receive similar emails two years later (and some in between of course).
I have a shirt from MemSQL that says "Friends don't let friends NoSQL" and I wear it proudly.
You’d be astounded how common it is at so-called “enterprise” startups. It blew my mind.
A lot of people simply never went through the LAMP stack days and have little/no experience with real databases like Postgres (or even MySQL). It’s disheartening.
But I'd think MongoDB the company increasing in revenue isn't totally related to the quality of MongoDB the database. In fact a lot of their products seem to be targeting the "I don't want to learn how to set it up and understand indexes" crowd.
For situations where you don't know the schema or for different schemas per record mongo is a great place to dump.
For data when you care about speed and don't care about losing some data. Think sending back a game screen when the client moves and requires a redraw. Depending on how fast the screen is changing dropping a screen isn't the biggest deal.
Reporting was a little bit more difficult but somehow rewarding.
Are you sure?
"""
Curiously, MongoDB omitted any mention of these findings in their MongoDB and Jepsen page. Instead, that page discusses only passing results, makes no mention of read or write concern, buries the actual report in a footnote, and goes on to claim:
> MongoDB offers among the strongest data consistency, correctness, and safety guarantees of any database available today.
We encourage MongoDB to report Jepsen findings in context: while MongoDB did appear to offer per-document linearizability and causal consistency with the strongest settings, it also failed to offer those properties in most configurations.
"""
This is a really professional to tell someone to stop their nonsense.
MongoDB explains that pretty well: https://www.mongodb.com/faq and https://docs.mongodb.com/manual/core/causal-consistency-read...
Postgres most certainly does fsync by default.
It's tru, you can disable it, but there is a big warning about "may corrupt your database" in the config file.
Whatever failings MySQL or PostgreSQL may or may not have are not important at all here.
>>> I have to admit raising an eyebrow when I saw that web page. In that report, MongoDB lost data and violated causal by default. Somehow that became "among the strongest data consistency, correctness, and safety guarantees of any database available today"! <<<
It's not wrong, just misleading. Seems overblown given that most practitioners know how to read this kind of marketing speak.
So basically whatever MongoDB was doing 10 years ago, they are continuing to do there. They did not change at all, yesterday or two days ago there were few people defending mongo that indeed in early years mongo want the greatest, but it is now and people should just stop being hang up in the past.
The reason why people lost their trust with mongo wasn't technical, it was this.
* Mongo: I like things easy, even if easy is dangerous. I probably write Javascript exclusively
* MySQL: I don't like to rock the boat, and MySQL is available everywhere
* PostgreSQL: I'm not afraid of the command line
* H2: My company can't afford a database admin, so I embedded the database in our application (I have actually done this)
* SQLite: I'm either using SQLite as my app's file format, writing a smartphone app, or about to realize the difference between load-in-test and load-in-production
* RabbitMQ: I don't know what a database is
* Redis: I got tired of optimizing SQL queries
* Oracle: I'm being paid to sell you Oracle
Did I miss something huge?
Arguably the world's most popular database is Microsoft Excel.
If a customer's API was down, the event would go back on the queue with a header saying to retry it after some time. You can do some sort of incantation to specifically retrieve messages with a suitable header value, to find messages which are ready to retry. We used exponential backoff, capped at one day, because the API might be down for a week.
I didn't think of RabbitMQ as a database when I started that work, but it looked a lot like it by the time I finished.
But also no, RabbitMQ and Kafka and the like are clearly message buses and though they might also technically qualify as a DB it would be a poor descriptor.
It used to be that bargain basement shared-hosting providers would only give you a LAMP stack, so it was MySQL or nothing. But if you're on RDS, Postgres every time for my money.
I'd probably choose Postgres over MySQL for a new project just to have the improved JSON support, but there's upsides to MySQL too:
- Per-thread vs per-process connection handling
- Ease of getting replication running
- Ability to use alternate engines such as MyRocks
Oracle is great if and only if you have a use case that fits their strengths you have an Oracle specific DBA, and you do not care about the cost. I have been on teams where we met those criteria, and I genuinely had no complaints within that context.
Every time I need to work with an Oracle DB it costs me weeks of wasted time.
For a specific example, I was migrating a magazine customer to a new platform, and all of the Oracle dumps and reads would silently truncate long textfields... The "Oracle experts" couldn't figure it out, and I had to try 5 different tools before finally finding one that let me read the entire field (it was some flavor of JDBC or something). To me, that's bonkers behavior, and is just one of the reasons I've sworn them off as anything other than con artists.
I gotta say, as much as I hate it with a passion, and as often as it breaks for seemingly silly reasons (so many deadlocks), it's at least tolerable (even if I feel like Postgres is better by just about every metric).
I'm familiar with the variant, "InfoSec won't let us deploy a DB on the same host".
sqlite> create table foo (n int);
sqlite> insert into foo (n) values ('dave');
sqlite> select count(*) from foo where n = 'dave';
1I can tell you this emphatically as I spent 6 months trying to eke out performance with MySQL (5.6). PostgreSQL (9.4) handled the load much better without me having to change memory allocators or do any kind of aggressive tuning to the OS.
MySQL has some kind of mutex lock that stalls all threads, it's not noticeable until you have 48cores, 32 databases and a completely unconstrained I/O.
EDIT: it was PG 9.4 not 9.5
Logical replication or synchronous multimaster replication may meet your needs.
Almost none of is remotely accurate e.g. RabbitMQ isn't even a database.
It may be good idea to take a break from the computer and find something less stressful to do.
We use it for a very specific use case and its been perfect for us when we need raw speed over everything. Data loss is tolerable.
Edit: never mind, I think the other URL - http://jepsen.io/analyses/mongodb-4.2.6 - deserves a more technical thread, so will invite aphyr to repost it instead. It had a thread already (https://news.ycombinator.com/item?id=23191439) but despite getting a lot of upvotes, failed to make the front page (http://hnrankings.info/23191439/). I have no idea why—there were no moderation or other penalties on it. Sometimes HN's software produces weird effects as the firehose of content tries to make it through the tiny aperture of the frontpage.
I'd pay to watch Kyle screaming at people in the MongoDB offices, not that he screams or anything. Just a spectacular mental image: "IT'S NOT ATOMIC! IT COULDN'T SERIALIZE A DOG'S DINNER!"
The stock market wants to see the product as a competitor with Oracle, so demands all the certifications that say so. MongoDB marketing wants to be able to collect money as if the product were competitive. Many of the customers have management that would be embarrassed to spend that kind of money on a database that is not. And, ultimately, many of the applications do have durability requirements for some of the data.
So, MongoDB's engineers are pulled in one direction by actual (paying) users, and the opposite direction by the money people. It's not a good place to be. They have very competent engineers, but they have set themselves a problem that might not be solvable under their constraints, and that they might not be able to prove they have solved, if they did. Time spent on it does not address what most customers want to see progress on.
The syntax is very nice, I honestly think a lot of it's early success came from ease of use.
Also this isn't 2011. MongoDB is not a competitor to Oracle and never really has been by people that knew that a DocumentDB was not usable as a SQL one. It's other SQL databases that are the real competitors e.g. Snowflake, Redshift are.
It is possible there are still potential users not buying until they get that story. MDB wants those users.
People have told me that they have since changed, but the evidence is overwhelmingly and repeatedly against them.
They seem to have been successful on marketing alone. Or people care more about speed and ease of use than durability, and my assumptions about what people want in a database are just wrong.
I think it depends. One could say the same about Redis, but it's wildly successful and people love it.
The difference is now they are advertised. Redis makes no claims to be anything other than what it is - a fast in-memory database that has some persistence capability but isn't meant to be a long-term data store. MongoDB, on the other hand, made (and continues to make) claims about being comparable in atomicity and durability to traditional SQL databases (but magically much faster!) that haven't withstood scrutiny.
Keep in mind, too, that most data ain't worth much. It's one thing to entrust data of low value in MongoDB; another to store mission-critical data in it. I would look askew at leadership who didn't ask hard questions about storing data worth millions or billions of dollars in MongoDB without frequent snapshots -- and even then, the value mustn't be contingent on the 100% accuracy of said data.
It's easier to reason about systems if there's fewer things that require durability guarantees, ideally you want to be able to draw data flows that look like a tree instead of a graph.
I find that Redis fits great because it's perfect for a whole bunch of different temporal shared state needs, everything from sessions to partial results. I've also deployed things like Ehcache, MongoDB, and Memcached to fit these needs and found other tools such as Kafka or RabbitMQ to be great "glue".
Having the root of your important data be something "boring" like Postgres or MySQL (or even Oracle!) is just good risk management to me. I wouldn't want to trust Redis or MongoDB for important data because it adds to the things I have to worry about. It's "keeping your eggs in one basket" while making sure that basket is really well looked after.
If the service had lasted longer, scaled bigger, and the business it supported had been more successful, we might have ended up with a now-classic MongoDB to pg migration. That was always an acceptable outcome, and it would have not invalidated going with Mongo at the start.
I assume that you mean write once data. If you mean write only you might as well use /dev/null.
- they don’t know why, it was just the one they learned/heard about first
- there is a lot of tooling for it
A lot of them even knew about the limitations of MongoDB but they still choose it.We concluded that other databases need to start prioritising usability; something few developer tools usually care about.
[1] https//supabase.io
I think 90% of the Mongo installs I've been exposed to were set up by people that were tired of fighting with Hibernate configurations and schema migrations.
It's also popular among people whose definition of "legacy software" is "that app I stopped working on after three months because I have something shiny and new."
But, if you need a traditional ACID database, the flexibility comes with punch in the groin technical debt.
I absolutely agree it's been used by people who just don't want to write SQL queries, or being used as a text-search-engine in place of something like more appropriate like ElasticSearch, but to mock successful projects who were based on it seems silly. It reminds me of interviewing candidates at a startup who primarily used PHP/MySQL. Most of them openly laughed and called it all horrible. I voted "no" on them, and sometimes injected a somewhat toxic "ah, you're right - we should close up shop. Someone call Facebook - tell them their tech stack is horrible - shut it all down!".
You can learn a lot about a developer by asking "What do you think about Mongo, JavaScript, or PHP", and if their response isn't a shrug, they're probably more concerned with what editor is correct than if the product they're building is useful. It's an exceptional filter to reject zealots and find pragmatists.
All that said, MariaDB with MyRocks is _awesome_, but certainly not with the default settings :)
Sure, if they’re being rude about it. A developer saying that it will not fit the use case or talking about spending a month of their time fixing a production issue caused by MongoDB will definitely not get a “no” from me. I’m not hiring subservient people I’m hiring people who can think for themselves and choose the right tool for the job, which Mongo rarely is.
It's a shame that Rethink did so many things right and failed as a company while Mongo continues to do almost everything wrong as a company and still gets business.
This seems to be more the rule than the exception, doesn't it?
It's even not that hard to come up with explanations for this, main one certainly being that popularity depends essentially upon simplicity.
And simplicity might not even be economically as inept as we would like it to be. Indeed, since only a small minority of all the systems that are designed reach production and stay there for long then it can make sense to use the quickest piece of junk available, at least until proven it will stick.
Easy access to changelogs should be an "easy to access" feature in all databases. Event driven systems aren't rare: the data store needs to be done to tell interested parties that underlying data has changed.
"MongoDB’s default level of write concern was (and remains) acknowledgement by a single node, which means MongoDB may lose data by default.
...Similarly, MongoDB’s default level of read concern allows aborted reads: readers can observe state that is not fully committed, and could be discarded in the future. As the read isolation consistency docs note, “Read uncommitted is the default isolation level”.
We found that due to these weak defaults, MongoDB’s causal sessions did not preserve causal consistency by default: users needed to specify both write and read concern majority (or higher) to actually get causal consistency. MongoDB closed the issue, saying it was working as designed"
What do I use in this situation:
1) I need to store 100,000,000+ json files in a database
2) query the data in these json files
3) json files come from thousands upon thousands of different sources, each with their own drastically different "schema"
4) constantly adding more json files from constantly new sources
5) no time to figure out the schema prior to adding into the database
6) don't care if a json file is lost once in awhile
7) only 1 table, no relational tables needed
8) easy replication and sharding across servers sought after
9) don't actually require json, so long as data can be easily mapped from json to database format and back
10) can self host, no cloud only lock-in
Recommendations?
Depends on what your queries look like, I guess.
Ironically once because mongo was such a pain to work with I dumped the data from it into ES to get the better API, usability and Kibana.
That sounds like a valid redress, or am I missing something ?
Basically, there are a large number of pitfalls that it's very easy to fall into unless you have an encyclopaedic knowledge of the documentation, and you need to ignore some of the words that are used (like "transaction" or "ACID") because they carry connotations that either do not apply or only apply if you do extra work to make it so.
In Mongo's defense, the defaults are similar to what you would likely have with a replicated MySQL/Postgres cluster (single node accepting writes with slaves replicating from there; no concept of write concern). My assumption here is that he is assuming the primary dies before the writes have replicated to the secondaries; that is exactly how master-slave fails too. Maybe there are systems folks can use for having write concern in those databases, but in the companies I've worked for we didn't have them and we definitely didn't have automated failovers
Is the argument that Mongo’s documentation isn’t clear?
I'm glad my gut instinct was correct and that it really wasn't worth the hype. It reminds me of Ruby on Rails.
Regardless of technical acumen, I believe RoR doesn't deserve to be compared to Mongo for one reason: the RoR developers never tried to gaslight their users into thinking they're the reason everything broke; they never said only "if you know what you're doing" can you avoid these hidden pitfalls.
If you set w: majority and r: linearizable/snapshot, both on collection, client and on transactions. Plus assuming you accept snapshot over Isolation. How bad are those remaining cases in reality and how do these issues compare to other databases? The final "read your future writes" error looks quite scary and does not seem to be caused by configuration error, same with "duplicate effects".
- Dwight Merriman, former CEO, and "one of the original authors of MongoDB" [1]
A word to the wise suffices. Sometimes the word in question is implied by other words.
For those who get this oblique post, note that throwing the above bon mot in an interview session for a "distributed systems engineer" and asking for an opinion is a excellent way to differentiate between Peter Principle and Principal Engineer.
[1]: https://web.archive.org/web/20100903213540/http://blog.mongo...
[1] https://community.ui.com/questions/MongoDB-corrupt-after-eve...
https://jepsen.io/analyses/mongodb-4.2.6
... and the corresponding HN thread here:
Data point: I have been running my production system (a fairly complex SaaS) on RethinkDB for the last 4 years.
From my point of view, RethinkDB is not regularly developed and improved. There is progress, but it's slow. Which is a pity, because it's a really good database, and one that tries really hard to be correct above all else.
The only other correct distributed database with strict serializable guarantees that I know of is FoundationDB, which nowhere near as easy to use as RethinkDB is (but it's somewhat easier with their document layer, which pretends to be MongoDB, just done right).
https://news.ycombinator.com/item?id=23253870
(not Mongo obviously)
To repeat my (non)answer:
There is no way to recommend NoSQL database without knowing what you need it for because NoSQL databases are highly specialized systems. If you need general-purpose database use an SQL one.
It's kind of a weird question, now that I think about it. Why would anyone seek out a database based on what it doesn't have?
If you're starting from just "I need to store some data" I'd look to e.g. Riak or Cassandra before looking to an SQL database.
And the recent change to a restrictive license is worrisome as well. I have been thinking of forking 3.4 and make it back to “true” open source and awesome performance. (If any C++ devs want to help out, reach out to me! username @gmail.com)
Mongo should never be a first choice, but a last choice for edge cases.
Yes, that's the thing, it's just a field type. It's not really that different than dumping your JSON in a TEXT column. MongoDB is fun because it's truly JSON - BSON - so you don't have to run migrations, you can store complex documents, and have a more object oriented way of storing your data than SQL.
It's a nice goal but there's likely not much of a commercial market for it, if that's your roadmap.
Please do; someone needs to take that first step, and then many more could potentially contribute.
This corruption is brought on by the stock market.
Have a look also at Shopify. They go and tack on 2% fees when customers use Google Pay or Apple Pay to checkout with. They recently announced that FB would be pulling ecom sales within app, and yet Shopify plans to charge 2% on top of FB fees. That's what I could gather despite the pricing being rather opaque.
Is this a step forward or backwards? Charging 2% / transaction for modern Internet protocols running on cheap hardware across a public network?
</rant>
https://hackingdistributed.com/2013/01/29/mongo-ft/
MongoDB: Broken By Design
And most of those listed in the blog were fixed many years before 2013.
Well, the warrior has lower upkeep costs. Keep that in mind.
The only thing similar about the two is that they both store data and have the letter D in their name. Otherwise they are completely different, Cassandra being a BigTable style database and MongoDB being a document one.
I hope this is a joke.
Really? Which job do you belive needs a "maybe store some of this data, sometimes" kind of database?
For example, climate data gathered from hundreds of thousands of devices every minute can very much survive some data to be lost. Or some astronomical observations data.
I wouldn't choose mongoDB for it, though.