[1] https://github.com/cybertec-postgresql/zheap [2] https://techcommunity.microsoft.com/t5/azure-database-for-po...
I agree with your point. Postgres is starting to stick out compared to alternatives:
- MS SQL supports uni-temporal tables using system time.
- Snowflake has time travel which acts like temporal tables but with a limited retention window. Seems more like a restore mechanism.
- MariaDB has system-versioned tables (doesn't look like it's in MySQL).
- Cockroach DB has uni-temporal support with system time but limited to the garbage collection period. The docs indicate you don't want a long garbage collection period since all versions are stored in a single range.
- Oracle seems to have the best temporal support with their flashback tech. But it's hard to read between the lines to figure out what it actually does.
https://www.cybertec-postgresql.com/en/zheap-undo-logs-disca...
https://github.com/cybertec-postgresql/postgres/tree/zheap_u...
Not completely related, but his lectures on databases on YouTube are really good. Much better than the DB class I had at college.
A great way to learn more about the inner workings of databases, and entertaining too.
Another choice quote (from one of his lectures):
“There’s only two things I care about in life:
1. My wife
2. Databases
I don’t give a f#ck about anything else”
Timescaledb looks very exciting, as it's "just" a PG extension, but their compression work looks great. [0]
I'm also really loving clickhouse, but haven't deployed that to production yet (haven't had the need to yet, almost did for an apache arrow reading thing, but didn't end up using arrow). They do some amazing things there, and the work they do is crazy impressive and fast. Reading their changelog they power through things.
[0] https://docs.timescale.com/timescaledb/latest/how-to-guides/...
The reality is that nothing is dominating. In 2021 there were more databases than ever each addressing a different use case. Companies don't have just one EDW they will have dozens even hundreds of siloed data stores. Startups will start with one for everything, then split out auth, user analytics, telemetry etc
There is no evidence of any consolidation in the market. And definitely not some mass trend towards PostgreSQL.
1. Ottertune doesn't sell PostgreSQL services, they sell a database optimization service that happens to support PostgreSQL (and other databases like MySQL)
2. PostgreSQL is definitely gaining market shares and fast, see the db-engine graph [1], you can compare it to the oracle trend if you are not convinced [2]
[1] https://db-engines.com/en/ranking_trend/system/PostgreSQL
A ML program that automatically tunes your production database in real-time. What could possibly go wrong?
Either way nothing to suggest that PostgreSQL is any way dominating.
No matter how much you tune / denormalize postgres, you'll never get the free text search performance elastic search offers. Our best efforts on a 5 million row table yielded 600ms query times vs 30-60ms.
Similarity with snow flake, you'd never expect postgres to perform analytical queries at that scale.
I know graph databases and Time series DB have similar performance tradeoffs.
I think the most interesting and challenging area is how to architect a system uses many of these databases and keeps them eventually consistent without some bound.
ZomboDB is a Postgres extension that enables efficient full-text searching via the use of indexes backed by Elasticsearch. https://github.com/zombodb/zombodb#readme
But for small to medium datasets his advice to just stick to PostgreSQL is good: Start with an easy solution which will give you anything you need (by simply installing a plugin). If you need more specialized software THEN use it, but don't start with an overcomplicated stack because ElasticSearch and ClickHouse may be the state-of-the-art open source solution to a specific problem.
1. Async replication tolerating data loss from slightly stale backup after a failover?
2. Sync replication tolerating downtime during manual failover?
3. Distributed consensus protocol for automated failover, high availability and no data loss, e.g. Viewstamped Replication, Paxos or Raft?
It seems like most managed service versions of databases such as Aurora, Timescale etc. are all doing option 3, but the open-source alternatives otherwise are still options 1 and 2?
I was assuming in both cases of manual failover that the operator would have to have some way of physically shutting down the old primary, then starting it again only as a backup that doesn't reply to clients. Alternatively, the cluster would need to remain unavailable if any node is partitioned.
But none of this is really very practical when compared to a consensus protocol (or R/W quorums) and distributed database. I'm genuinely curious how people solve this with something like Postgres. Or is it perhaps something that isn't much worried about?
Well, unless each node has a complete copy of the data?
The critical importance of extensibility as a primary concern of successful DB products needs to be highlighted. Realities of the domain dictate that product X matures a few years after inception, at which point the application patterns may have shifted. (Remember map-reduce?) If you pay attention, for example, you'll note that the du jour darlings are scrambling to claim fitness for ML (a subset of big-data), and the new comers are claiming to be "designed for ML".
Smart VC money should be on extensible players ..
Dgraph uses graphql as its native query language.
Anyone here has some experience to share on it? ... Since it isn't mentioned in the article.
I can't say I can relate, but I do appreciate being this passionate about things!
Professional lives should be separate from personal but please, indulge us with a story!
> Rockset joined in, saying its performance is was better for real-time analytics than the other two.
So I went and read the linked Rockset comparison blog post, and while I get that it’s a marketing piece, it’s also so transparently desperate for any advantage over Druid and ClickHouse that their criteria is bizarre at best, and bordering on wildly incorrect at worst.
I’ve been burnt by commercial databases before, and I have a hard time justifying ever using one, especially considering the advent of open source databases that have feature and performance parity (if not outright superiority) and can be self-hosted on K8s, or managed-hosting can be easily purchased.
I think we’ll get even more hosting options now that ClickHouse is it’s own backing company.
But I guess not much else happened to it other than PlanetScale.
--------
An architect demoed the failure of a shard and the automatic promotion of its backup shard to main, in production. They actually test their failure models.As I see it, sharding is not very hard. HA is not very hard given a reasonable SLA. But sharding with HA on a large setup that actually works is pretty hard.
Another thing that stuck in my mind was their high throughput-per-provisioned-hardware ratio. With not much hardware they were pulling 80k queries per second with room to spare.
Although I have to say, that's not much compared to GitHub which pulls 1.2 million queries/sec on Vitess [0].
[0] https://github.blog/2021-09-27-partitioning-githubs-relation...
Looking forward, instead of backward, it would be great for databases to have some kind of live-patch/ live-update feature so that one does not need any downtime at all if some rules are obeyed (with an automatic check, if that is the case). The same is for operating systems, where we have parts of the technology and even some limited deployment, but nothing of it is the default as far as I know. This situation makes it quite a bit harder to develop and maintain systems without introducing extreme complexity. It does not look like we will have less bugs/ less patches any time soon so we should make updating as easy as possible to drastically reduce the need for a maintenance window without resorting to building clusters for everything.
It's great to see the current team is on the move again, as the original ParAccel architecture did not scale very well. There was an excellent talk on Redshift in Andy Pavlo's Vaccination Database Tech Talks, 2nd Dose. [0] It's by Ippokratis Pandis and worth a view. It covers a lot of the recent improvements, which are likely to disappoint the many critics who have counted Redshift out. (Prematurely in my opinion.)
Does anyone else feel like a caveman when modeling a many to many relationship in a normalized schema, and then querying via SQL?
I’m surprised graph DBs aren’t more popular for this reason alone. Maybe it’s a far fetched dream, but perhaps a graph frontend can be slapped onto the Postgres backend.
Graph databases will not overtake relational databases in 2030 by marketshare.
Bookmark this comment. Reach out to me in 2030. If I'm wrong, I will replace my official CMU photo with one of me wearing a shirt that says "Graph Databases Are #1". I will use that photo until I retire, get fired, or a former student stabs me.
For example...One of the hits against RDBMS is that the structure is supposedly "rigid." That's simply not the case in many RDBMS, such as those using column storage. Adding columns in databases like ClickHouse is a trivial metadata operation. This means that many problems that Neo4j solves can be addressed in a more general-purpose RDBMS, because you can add columns easily to track relationships. It's pretty easy to envision other improvements to access methods to make searches more efficient.
I don't mean to undercut in any way the innovation of graph databases. It's just that the relational model is (a) extremely general and (b) can be extended.
As for the first question - I've tried using Neo4j and ArangoDB for relatively large-scale graph querying (1-2TB of data) and both couldn't hold a candle to Postgres or MySQL in terms of query performance for cost. Neo requires you to store most of your data in memory and Arango isn't great for cross-shard querying.
Unless there's some major new graph DB that comes out in the next few years I would still bet on relational being dominant in 2030.
They say that they scale well. I have not tried any graphdb for prod work yet.
Either way, that’s not happening.
My god do we need an atlas of database related Apache projects.
It's almost as bad as java web frameworks about ten years ago.
Everyone can do everything and it's hard to know what is better for what.
I get that projects can be donated to Apache from disparate sources, but my god it’s still a disaster.
I think more scalable systems will continue to gain market share. It will be interesting to see if PlanetScale, CockroachDb or some other actually becomes a big player.