For OLAP use cases with real-time data ingestion requirements, object-storage-only approach also leads to write amplification. Therefore, I don't think that architectures like Apache Pinot, Apache Paimon, and Apache Druid are going anywhere.
Another problem with "open table formats" like Iceberg, Hudi, and Delta Lake is their slow innovation speed.
I've recently argued about this at greater length here: https://engineeringideas.substack.com/p/the-future-of-olap-t...
As ever, at FANGMA-whatever scale/use cases, yeah I’d agree with you. But the majority of cases are not FANGMA-whatever scale/use cases.
Basically, it’s good enough for most people. Plus it takes away a bunch of complexity for them.
> If the application has sustained query load
Analytical queries in majority cases are not causing sustained load.
It’s a few dashboards a handful of managers/teams check a couple of times through the day.
Or a few people crunching some ad hoc queries (and hopefully writing the intermediate results somewhere so they don’t have to keep making the same query — I.e. no sustained load problem).
> real-time data ingestion requirements
Most of the time a nightly batch job is good enough. Most Businesses still work on day by day or week by week, and that’s at the high frequency end of things.
> slow innovation speed
Most people don’t want bleeding edge innovative change. They want stability.
Data engineers have enough problems with teams changing source database fields without telling us. We don’t need the tool we’re storing the data with to constantly break too.
I think we're probably too early to build this today. Ray is used at my current job for scaling subroutines in our distributed job system. It's the closest I've seen.
Shameless plug, but there’s also Oban, a widely used database backed background job system.
The article mentions actor-model frameworks like Akka. Is that not like Ray?
At work we use and maintain something similar called Cloud Haskell (confusingly implemented in a package called distributed-process: https://github.com/haskell-distributed/distributed-process) and I have to say that using it is a breeze.
[0] and it’s an oversimplification because while concurrency can reasoned about conceptually, parallelism implicates implementation realities around computation resources and data locality.
Heavily simplified version — Each partition is a separate file containing a bunch of table rows. And partition splits are determined by the values in those rows.
If you’ve got data with like a date column (sign up date or order date or something), you would partition on a YYYY-MM field you create early on.
Each time you run a query filtering by YYYY-MM, your OLAP query tool no longer needs to read bunch of files from disk or S3. If you only want to look at 2023-12, then you only need to read one file to run the query.
Edit — OLAP kinda stuff is all about getting the data “slices” nicely organised for queries people will run later.