Ask HN: Does (or why does) anyone use MapReduce anymore?

106 pointsbk1462y ago67 comments

Excluding the Hadoop ecosystem, I see some references to MapReduce in other database and analysis tools (e.g., MatLab). My perception was that Spark completely superseded MapReduce. Are there just different implementations of MapReduce and the one that Hadoop implemented was replaced by Spark?

67 comments

tjhunter2y ago

(2nd user & developer of spark here). It depends on what you ask.

MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.

MapReduce as a concept is very much in use. Hadoop was inspired by MapReduce. Spark was originally built around the primitives of MapReduce, and you see still see that in the description of its operations (exchange, collect). However, spark and all the other modern frameworks realized that:

- users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, ...)

- mapreduce was great for one-shot batch processing of data, but struggled to accomodate other very common use cases at scale (low latency, graph processing, streaming, distributed machine learning, ...). You can do it on top of mapreduce, but if you really start tuning for the specific case, you end up with something rather different. For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

dataflow2y ago

Also see https://news.ycombinator.com/item?id=37313576

H8crilA2y ago

There really was always only Map and Shuffle (Reduce is just Shuffle+Map; also another name for Shuffle is GroupByKey). And you see those primitives under the hood of most parallel systems.

refulgentis2y ago

Shuffle is interesting, I gotta read up on that. Maybe I've been hearing reduce for too long and have too much of a built-in visual sense of it but...shuffle does not seem like the right name at all, then I picture randomizing some set N, where the input and output counts are the same.

1 more reply

lupire2y ago

Reduce is useful for aggregate metrics.

1 more reply

VirusNewbie2y ago

For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

Are you confusing kafka with something else? Kafka is a persistent write append queue.

dtoma2y ago

The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).

As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.

nwsm2y ago

This book looks interesting, should I buy it or does anyone else have newer recommendations? I have Designing Data-Intensive Applications which is a fantastic overview and still holds up well.

erikerikson2y ago

That was one of "the" books in the space prior to DDIA. In my opinion Akidao mixes the logic for processing events with the stream infrastructure implementation because he was writing from the context of his particular use cases. The time that I spoke with him it seemed that his influence had driven to the design of Google's systems and GCP such that they didn't properly prioritize ordering/linearity/consistency requirements. At this point my copy is of historic interest to me.

629514132y ago

It has the most interesting/conceptual/detailed discussion of the streaming system semantics (e.g. interplay of windows and stateful stream operations) I'm aware of to this day. At least as far as Manning/O'Reilly-level books go. So I'd put it on the same bookshelf as DDIA.

It's a little biased towards Beam and away from Spark/Flink though. Which makes it less practical and more conceptual. So as long as it's your cup of tea go for it.

bk146OP2y ago

Thank you, I'll check this out!

throwaway59592y ago

I feel like that’s kind of like saying we don’t use Assembly anymore now that we have C. We’ve just built higher level abstractions on top of it.

falcor842y ago

Yeah, that's exactly how I read the question, i.e. analogous to "does anyone still code in assembly, or has everyone switched to using abstractions?" and I think it's a very interesting one.

KaiserPro2y ago

Its more like does anyone use goto.

Why use map:reduce when you can have an entire DAG for fanout/in?

dehrmann2y ago

At a high level, most distributed data systems look something like MapReduce, and that's really just fancy divide-and-conquer. It's hard to reason about, and most data at this size is tabular, so you're usually better off using something where you can write a SQL query and let the query engine do the low-level map-reduce work.

BenoitP2y ago

The concept is quite alive, and the fancy deep learning have it: jax.lax.map, jax.lax.reduce.

It's going to stay because it is useful:

Any operation that you can express with an associative behavior is automatically parallelizeable. And both in Spark and Torch/Jax this means scalable to a cluster, with the code going to the data. This is the unfair advantage of solving bigger problems.

If you were talking about the Hadoop ecosystem, then yes Spark pretty much nailed it and is dominant (no need to have another implementation)

atombender2y ago

That's my understanding. MR is very simplistic and awkward/impossible to express many problems in, whereas dataflow processors like Spark and Apache Beam support creating complex DAGs of rich set of operators for grouping, windowing, joining, etc. that you just don't have in MR. You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.

danpalmer2y ago

> You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.

I think it's the opposite of this. MapReduce is a very generic mechanism for splitting computation up so that it can be distributed. It would be possible to build Spark/Beam and all their higher level DAG components out of MapReduce operations.

atombender2y ago

I don't mean generalization that way. Dataflow operators can be expressed as MR as the underlying primitive, as you say. But MR itself, as described in the original paper at least, only has the two stages, map and reduce; it's not a dataflow system. And it turns out people want dataflow systems, not hand-code MR and do the DAG manually.

eru2y ago

I'm not sure what you describe is the opposite?

I mean, you can implement function calls (and other control flow operators like exceptions or loops) as GOTOs and conditional branches, and that's what your compiler does.

But that doesn't really mean it's useful to think of GOTOs being the generalisation.

Most of the time, it's just the opposite: you can think of a GOTO as a very specific kind of function call, a tail-call without any arguments. See eg https://www2.cs.sfu.ca/CourseCentral/383/havens/pubs/lambda-...

dikei2y ago

MapReduce was basically a very verbose/imperative way to perform scalable, larger than memory aggregate-by-key operation.

It was necessary as a first step, but as soon as we had better abstraction, everyone stopped using it directly except for legacy maintenance of course.

lupire2y ago

The abstraction came first. MapReduce was quickly used as a basis for larger-than-machine SQL (Google Dremel and Hadoop Pig). MapReduce was separately useful when the processing pieces require a lot of custom code that doesn't fit well into SQL (because you have hierarchical records, not purely relational, for example)

DeathArrow2y ago

Can you point, please, to the better abstractions?

willvarfar2y ago

SQL comes to mind.

Every time you run an SQL query on BigQuery, for example, you are executing those same fundamental map shuffle primitives on underlying data, it's just that the interface is very different.

qoega2y ago

Now you rarely use basic MapReduce primitives, you have another layer of abstraction that can run on infrastructure that was running MR jobs before. This infrastructure allows to efficiently allocate some compute resources for "long" running tasks in a large cluster with respect to memory/cpu/network and other constraints. So basically schedulers of MapReduce jobs and cluster management tools became that good, because MR methodology had trivial abstractions, but required efficient implementation to make it work seamlessly.

Abstraction layers on top of this infrastructure now can optimize pipeline as a whole by merging several steps into one when possible, add combiners(partial reduce before shuffle). It requires whole processing pipeline to be defined in more specific operations. Some of them propose to use SQL to formulate task, but it can be done using other primitives. And given this pipeline it is easy to implement optimizations making whole system much more user-friendly and efficient compared to MapReduce, when user has to think about all the optimizations and implement them inside single map/reduce/(combine) operations.

monero-xmr2y ago

The correct language for querying data is, as always, SQL. No one cares about the implementation details.

“I have data and I know SQL. What is it about your database that makes retrieving it better?”

Any other paradigm is going to be a niche at best, likely outright fail.

esafak2y ago

Spark is really failing, all right.

SQL lacks type safety, testability, and composability.

monero-xmr2y ago

It’s crazy to think how old I am now. But give it 20 more years and you’ll come around.

4 more replies

nprateem2y ago

I know I'm replying to a troll comment, but:

> “I have data and I know SQL. What is it about your database that makes retrieving it better?”

Because my data comes from a variety of unstructured, possibly dirty sources which need cleaning and transforming before they can be made sense of.

dijksterhuis2y ago

> Because my data comes from a variety of unstructured, possibly dirty sources which need cleaning and transforming before they can be made sense of.

Seattle data guy had a great end of year top 10 memes post recently and one of them went like this

> oh cool you’ve hired a data scientist. so you have a collection of reliable and easy to query data sources, right?

> …

> you do have a collection of reliable and easy to query data sources, right?

—-

Like, most of the time in businesses… if the data can’t be queried with SQL then it’s not ready to be used by the rest of the business. Whether that’s for dashboards, monitoring, downstream analytics or reporting. Data engineers do the dirty data cleaning. Data scientists do the actual science.

That’s what I took from the parent at least.

YMMV obviously depending on your domain. ML being a good example where things like end to end speech-to-text operates on wav files directly.

1 more reply

monero-xmr2y ago

It's amazing that you think I'm trolling! The #1 way to get more customers of something as extreme as a new database is to use the tool that potential customers already know and have integrated into their systems. That's SQL. The same logic is for any new paradigm.

Ignore that statement, and fight the uphill battle.

justrealist2y ago

You have no idea how long the tail of legacy MR-based daily stat aggregation workflows is in BigCorps.

The batch daily log processor jobs will last longer than Fortran. Longer than Cobol. Longer than earth itself.

yjftsjthsd-h2y ago

> The batch daily log processor jobs will last longer than Fortran. Longer than Cobol.

Nonsense... They'll end at the same time. Which is approximately concurrently with the universe.

danvk2y ago

Or 2037.

codr72y ago

It definitely played its role in high lighting what most functionish coders already knew: that map/filter/reduce is an awesome model for data processing pipelines.

dehrmann2y ago

Huge caveat to this: systems like Hadoop accomplish the parallelization by adding keys, and those FP constructs don't broadly have this. It's better to think of it as parallel divide-and-conquer.

nathants2y ago

the idea of map reduce remains a good one.

there are a number of interesting innovations in streaming systems that followed, mostly around reducing latency, reducing batch size, and failure strategies.

even hadoop could be hard to debug when hitting a performance ceiling for challenging workloads. the streaming systems took this even further, spark being notorious for fiddle with knobs and pray the next job doesn’t fail after a few hours, again.

i played around with the thinnest possible distributed data stack a while back[1][2]. i wanted to understand the performance ceiling for different workloads without all the impenetrable layers of software bureaucracy. turns out modern network and cpu are really fast when you stop adding random layers like lasagna.

i think the future of data, for serious workloads, is gonna be bespoke. the primitives are just too good now, and the tradeoff for understandability is often worth the cost.

1. https://github.com/nathants/s4

2. https://github.com/nathants/bsv

dijit2y ago

I feel like a lot of the underlying concepts of mapreduce live large in multi-threaded applications even on a single machine.

Its definitely not a dead concept, I guess its not sexy to talk about though.

KaiserPro2y ago

From what I recall Map:Reduce was basically a weird representation of DAG based pipeline.

I think because it looked sorta like an automatic dictionary to multi-thread converter it became popular. But its pretty useless unless you know how to split up and process your data.

basically, if you can cut your data up into a queue, you can MapReduce. But, most pipelines are more complex than that, so you probably need a proper DAG with dependencies.

Dopameaner2y ago

The paradiagm is very much alive. Look at Query_then_fetch in elasticsearch

j / k navigate · click thread line to collapse

67 comments

tjhunter2y ago

(2nd user & developer of spark here). It depends on what you ask.

MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.

- users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, ...)

dataflow2y ago

Also see https://news.ycombinator.com/item?id=37313576

H8crilA2y ago

There really was always only Map and Shuffle (Reduce is just Shuffle+Map; also another name for Shuffle is GroupByKey). And you see those primitives under the hood of most parallel systems.

refulgentis2y ago

1 more reply

lupire2y ago

Reduce is useful for aggregate metrics.

1 more reply

VirusNewbie2y ago

For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

Are you confusing kafka with something else? Kafka is a persistent write append queue.

dtoma2y ago

nwsm2y ago

This book looks interesting, should I buy it or does anyone else have newer recommendations? I have Designing Data-Intensive Applications which is a fantastic overview and still holds up well.

erikerikson2y ago

629514132y ago

It's a little biased towards Beam and away from Spark/Flink though. Which makes it less practical and more conceptual. So as long as it's your cup of tea go for it.

bk146OP2y ago

Thank you, I'll check this out!

throwaway59592y ago

I feel like that’s kind of like saying we don’t use Assembly anymore now that we have C. We’ve just built higher level abstractions on top of it.

falcor842y ago

Yeah, that's exactly how I read the question, i.e. analogous to "does anyone still code in assembly, or has everyone switched to using abstractions?" and I think it's a very interesting one.

KaiserPro2y ago

Its more like does anyone use goto.

Why use map:reduce when you can have an entire DAG for fanout/in?

dehrmann2y ago

BenoitP2y ago

The concept is quite alive, and the fancy deep learning have it: jax.lax.map, jax.lax.reduce.

It's going to stay because it is useful:

If you were talking about the Hadoop ecosystem, then yes Spark pretty much nailed it and is dominant (no need to have another implementation)

atombender2y ago

danpalmer2y ago

> You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.

atombender2y ago

eru2y ago

I'm not sure what you describe is the opposite?

I mean, you can implement function calls (and other control flow operators like exceptions or loops) as GOTOs and conditional branches, and that's what your compiler does.

But that doesn't really mean it's useful to think of GOTOs being the generalisation.

dikei2y ago

MapReduce was basically a very verbose/imperative way to perform scalable, larger than memory aggregate-by-key operation.

It was necessary as a first step, but as soon as we had better abstraction, everyone stopped using it directly except for legacy maintenance of course.

lupire2y ago

DeathArrow2y ago

Can you point, please, to the better abstractions?

willvarfar2y ago

SQL comes to mind.

Every time you run an SQL query on BigQuery, for example, you are executing those same fundamental map shuffle primitives on underlying data, it's just that the interface is very different.

qoega2y ago

monero-xmr2y ago

The correct language for querying data is, as always, SQL. No one cares about the implementation details.

“I have data and I know SQL. What is it about your database that makes retrieving it better?”

Any other paradigm is going to be a niche at best, likely outright fail.

esafak2y ago

Spark is really failing, all right.

SQL lacks type safety, testability, and composability.

monero-xmr2y ago

It’s crazy to think how old I am now. But give it 20 more years and you’ll come around.

4 more replies

nprateem2y ago

I know I'm replying to a troll comment, but:

> “I have data and I know SQL. What is it about your database that makes retrieving it better?”

Because my data comes from a variety of unstructured, possibly dirty sources which need cleaning and transforming before they can be made sense of.

dijksterhuis2y ago

> Because my data comes from a variety of unstructured, possibly dirty sources which need cleaning and transforming before they can be made sense of.

Seattle data guy had a great end of year top 10 memes post recently and one of them went like this

> oh cool you’ve hired a data scientist. so you have a collection of reliable and easy to query data sources, right?

> …

> you do have a collection of reliable and easy to query data sources, right?

—-

That’s what I took from the parent at least.

YMMV obviously depending on your domain. ML being a good example where things like end to end speech-to-text operates on wav files directly.

1 more reply

monero-xmr2y ago

Ignore that statement, and fight the uphill battle.

justrealist2y ago

You have no idea how long the tail of legacy MR-based daily stat aggregation workflows is in BigCorps.

The batch daily log processor jobs will last longer than Fortran. Longer than Cobol. Longer than earth itself.

yjftsjthsd-h2y ago

> The batch daily log processor jobs will last longer than Fortran. Longer than Cobol.

Nonsense... They'll end at the same time. Which is approximately concurrently with the universe.

danvk2y ago

Or 2037.

codr72y ago

It definitely played its role in high lighting what most functionish coders already knew: that map/filter/reduce is an awesome model for data processing pipelines.

dehrmann2y ago

Huge caveat to this: systems like Hadoop accomplish the parallelization by adding keys, and those FP constructs don't broadly have this. It's better to think of it as parallel divide-and-conquer.

nathants2y ago

the idea of map reduce remains a good one.

there are a number of interesting innovations in streaming systems that followed, mostly around reducing latency, reducing batch size, and failure strategies.

i think the future of data, for serious workloads, is gonna be bespoke. the primitives are just too good now, and the tradeoff for understandability is often worth the cost.

1. https://github.com/nathants/s4

2. https://github.com/nathants/bsv

dijit2y ago

I feel like a lot of the underlying concepts of mapreduce live large in multi-threaded applications even on a single machine.

Its definitely not a dead concept, I guess its not sexy to talk about though.

KaiserPro2y ago

From what I recall Map:Reduce was basically a weird representation of DAG based pipeline.

I think because it looked sorta like an automatic dictionary to multi-thread converter it became popular. But its pretty useless unless you know how to split up and process your data.

basically, if you can cut your data up into a queue, you can MapReduce. But, most pipelines are more complex than that, so you probably need a proper DAG with dependencies.

Dopameaner2y ago

The paradiagm is very much alive. Look at Query_then_fetch in elasticsearch

j / k navigate · click thread line to collapse