MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.
MapReduce as a concept is very much in use. Hadoop was inspired by MapReduce. Spark was originally built around the primitives of MapReduce, and you see still see that in the description of its operations (exchange, collect). However, spark and all the other modern frameworks realized that:
- users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, ...)
- mapreduce was great for one-shot batch processing of data, but struggled to accomodate other very common use cases at scale (low latency, graph processing, streaming, distributed machine learning, ...). You can do it on top of mapreduce, but if you really start tuning for the specific case, you end up with something rather different. For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.
Are you confusing kafka with something else? Kafka is a persistent write append queue.
As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.
It's a little biased towards Beam and away from Spark/Flink though. Which makes it less practical and more conceptual. So as long as it's your cup of tea go for it.
Why use map:reduce when you can have an entire DAG for fanout/in?
It's going to stay because it is useful:
Any operation that you can express with an associative behavior is automatically parallelizeable. And both in Spark and Torch/Jax this means scalable to a cluster, with the code going to the data. This is the unfair advantage of solving bigger problems.
If you were talking about the Hadoop ecosystem, then yes Spark pretty much nailed it and is dominant (no need to have another implementation)
I think it's the opposite of this. MapReduce is a very generic mechanism for splitting computation up so that it can be distributed. It would be possible to build Spark/Beam and all their higher level DAG components out of MapReduce operations.
I mean, you can implement function calls (and other control flow operators like exceptions or loops) as GOTOs and conditional branches, and that's what your compiler does.
But that doesn't really mean it's useful to think of GOTOs being the generalisation.
Most of the time, it's just the opposite: you can think of a GOTO as a very specific kind of function call, a tail-call without any arguments. See eg https://www2.cs.sfu.ca/CourseCentral/383/havens/pubs/lambda-...
It was necessary as a first step, but as soon as we had better abstraction, everyone stopped using it directly except for legacy maintenance of course.
Every time you run an SQL query on BigQuery, for example, you are executing those same fundamental map shuffle primitives on underlying data, it's just that the interface is very different.
Abstraction layers on top of this infrastructure now can optimize pipeline as a whole by merging several steps into one when possible, add combiners(partial reduce before shuffle). It requires whole processing pipeline to be defined in more specific operations. Some of them propose to use SQL to formulate task, but it can be done using other primitives. And given this pipeline it is easy to implement optimizations making whole system much more user-friendly and efficient compared to MapReduce, when user has to think about all the optimizations and implement them inside single map/reduce/(combine) operations.
“I have data and I know SQL. What is it about your database that makes retrieving it better?”
Any other paradigm is going to be a niche at best, likely outright fail.
SQL lacks type safety, testability, and composability.
> “I have data and I know SQL. What is it about your database that makes retrieving it better?”
Because my data comes from a variety of unstructured, possibly dirty sources which need cleaning and transforming before they can be made sense of.
Seattle data guy had a great end of year top 10 memes post recently and one of them went like this
> oh cool you’ve hired a data scientist. so you have a collection of reliable and easy to query data sources, right?
> …
> you do have a collection of reliable and easy to query data sources, right?
—-
Like, most of the time in businesses… if the data can’t be queried with SQL then it’s not ready to be used by the rest of the business. Whether that’s for dashboards, monitoring, downstream analytics or reporting. Data engineers do the dirty data cleaning. Data scientists do the actual science.
That’s what I took from the parent at least.
YMMV obviously depending on your domain. ML being a good example where things like end to end speech-to-text operates on wav files directly.
Ignore that statement, and fight the uphill battle.
The batch daily log processor jobs will last longer than Fortran. Longer than Cobol. Longer than earth itself.
Nonsense... They'll end at the same time. Which is approximately concurrently with the universe.
there are a number of interesting innovations in streaming systems that followed, mostly around reducing latency, reducing batch size, and failure strategies.
even hadoop could be hard to debug when hitting a performance ceiling for challenging workloads. the streaming systems took this even further, spark being notorious for fiddle with knobs and pray the next job doesn’t fail after a few hours, again.
i played around with the thinnest possible distributed data stack a while back[1][2]. i wanted to understand the performance ceiling for different workloads without all the impenetrable layers of software bureaucracy. turns out modern network and cpu are really fast when you stop adding random layers like lasagna.
i think the future of data, for serious workloads, is gonna be bespoke. the primitives are just too good now, and the tradeoff for understandability is often worth the cost.
Its definitely not a dead concept, I guess its not sexy to talk about though.
I think because it looked sorta like an automatic dictionary to multi-thread converter it became popular. But its pretty useless unless you know how to split up and process your data.
basically, if you can cut your data up into a queue, you can MapReduce. But, most pipelines are more complex than that, so you probably need a proper DAG with dependencies.