Joining a billion rows 20x faster than Apache Spark (opens in new tab)

(snappydata.io)

153 pointsplamb9y ago83 comments

83 comments

Am I reading this correctly? The testbed was a single laptop? A big part of spark is the distributed in-memory aspect so I'm not sure I understand why any of these numbers mean anything.

quantumhobbit9y ago

This paper is a must read: https://pdfs.semanticscholar.org/6753/959eed800e9fad9e330daa...

People keep stumbling upon the same thing over and over which is that the ability to scale has significant overhead.

edw9y ago

Back in '10, I needed a three or four node Hadoop cluster just to match the performance I was getting using a spare Mac mini in development mode when I was doing a lot of work in Cascalog, which is based on Cascading.

Most problems are not Big Data problems. The size a problem must be before it qualifies as a Big-Data problem grows larger every day with the availability of machines with ever-more cores and memory. `Sed`, `awk`, `grep`, `sort`, `join`, and so forth are some of the least appreciated tools in the Unix toolbox.

People want to think they have Big Data problems but they probably just have plain old normal-data problems. I have had to unwind the ridiculous, heavy-weight, Big Data solutions to normal-data problems that "kids today" love.

If you don't work for Netflix or Google or Facebook or insert maybe a hundred other companies here, you probably do not have a Big Data problem.

8 more replies

banachtarski9y ago

The correct way to compare software A and software B is to benchmark both on the target platform/hardware they were respectively written for. Afterwards, do a cost-benefit analysis.

edit: (accidentally hit the submit button early).

I don't think people should leverage highly distributed software for small workloads, for the same reason they shouldn't write highly parallelized code for things that run perfectly fine on one thread. But the test, while well-intentioned, seems to miss the mark.

1 more reply

dajohnson899y ago

I'm not sure why you got downvoted. It's a valid point and it makes intuitive sense.

Is there a clear cut answer, as to whether one should choose a distributed solution or not? It seems to me that if you're at the Terabyte scale, choosing non-distributing seems to be asking for trouble. A quick search indicates the largest HDD you can buy is around 8TB.

1 more reply

jupiter900009y ago

Different example, doing a simple 'group by' sparksql query on only about 20 million rows on a distributed phoenix/hbase table couldn't even be completed because of spark dumbly shuffling all the data around the cluster. Spark/phoenix RDD drivers apparently had no 'group by' push down support for phoenix so shuffled all the data amazingly inefficiently. Running the same query directly on phoenix took all of about a minute to finish.

My point is, these 'on a laptop/single machine memory' examples don't really give me an indicator of scenarios where I might actually want to use spark/etc.

j-m-o9y ago

Hey, I'm the phoenix-spark author here. You're totally right, right now there is a lot of dumb shuffling around for certain operations. Hopefully some of that will get fixed up in the next release [1].

[1] https://issues.apache.org/jira/browse/PHOENIX-3600

dandermotj9y ago

You're trying to GROUP BY on a distributed data store; your code is the problem, not Spark SQL. Use CLUSTER BY - it's distributed sibling.

Query languages like HiveQL and Spark SQL were designed to look like SQL, but they're not.

1 more reply

jagsr1239y ago

From what I know, this test originally written by Databricks (expanded here) is meant to tease out the optimizations in the Tungsten engine. Of course, a distributed query that is dominated by shuffle costs will produce a very different result.

sumwale9y ago

The test uses 4 partitions (4 threads) and is not single-threaded. Of course, distributing over network will have network costs which can be significant for sub-second queries but will become insignificant for larger data and more involved queries. The Spark execution plan will be exactly same on laptop or in cluster, and these specific queries will scale about linearly.

filereaper9y ago

I apologize in advance, but whenever people claim to use a in-memory big-data system, how exactly does this end up working?

You can only stuff so much into memory, so you can scale up vertically in-terms of memory, unless you buy a massive big-iron POWER box, you scale out horizontally. But with each of these in-memory appliances, what happens when you need to spill out to disk?

In essence why should one bother with these in-memory appliances as opposed to buying boxes with fast SSD's instead? Sure you spill out to disk, but do you take that big of a hit compared to the enormous cost of keeping everything in memory?

stingraycharles9y ago

I think there are many use cases. Fraud detection, risk analysis in finance, weather simulations, etc. These don't need to spill out to disk and are a perfect use case for these systems.

A friend of mine works for a company that does high speed weather analysis to make predictions for energy brokers, to predict prices of wind / solar energy on the market. They use these kind of systems extensively, because of the speed and volatility of the data. Fascinating stuff.

rodionos9y ago

You can also measure cloud oktas from satellite imagery of you want to get fancy in terms of solar energy supply side forecasting: https://axibase.com/calculating-cloud-oktas/

blhack9y ago

Maybe I'm misunderstanding the problem, but why can't you scale out horizontally?

If the problem is that queries or sets of data might have to jump nodes, couldn't the data be designed in such a way where an assumption is made about what sorts of queries will happen at write?

Optimize so that node spanning is rare, eat the cost when it does happen, and let those 1/n queries disappear into the average.

SapphireSun9y ago

My lab works with multi-terabyte datasets on a regular basis. We have big machines to do machine learning on, but when they're not in used, I can tell you that it's way easier to provision and write a single or multi-threaded script that just loads everything into memory rather than deal with networking and partitioned data.

Imagine the difference between setting up a spark cluster and writing a for loop. For instance, for reasons someone created a 1TB hdf5 file. Luckily, we had a computer with 500GB+ of ram and lots of swap, so instead of having to hack the file apart and figure out how to chunk or parallelize it, we loaded it into memory for a one time batch job and did other useful things in the mean time.

alexchamberlain9y ago

It's not big data if it fits in memory... This article is demonstrating an architecture that may scale well with big data.

usgroup9y ago

Lol was hoping it was a combination of awk and paste :)

That always makes me chuckle.

Honestly though ... Jenkins + bash + cloud storage and you'll be surprised at how many big data problems you can solve with a fraction of the complexity.

pmlnr9y ago

https://aadrake.com/command-line-tools-can-be-235x-faster-th...

Discussion is here: https://news.ycombinator.com/item?id=8908462

makapuf9y ago

Pardon my ignorance but what would you use Jenkins for ? Scheduling ?

tetha9y ago

Jenkins in such a setting gives you two good things: i) scheduling. and ii) access control. The ability to give random dude X the ability to trigger computation Y, Z, and A, without the ability to change said computations.

usgroup9y ago

Triggering jobs more generally (schedule or push notification) and splitting things into jobs and/or pipelines.

EGreg9y ago

This seems like impressive stats about a relational database technology. But the scrolling on their website doesn't work on mobile. So in grand HN tradition, I left and now tell you all about it here, instead of the main point of their invention :)

franciscop9y ago

It worked for me but the nav of the browser didn't hide, which I recognize as messing around with absolute/fixed positioning and/or overflows. I'd recommend to use media queries to show a simple site on mobile and leave all the fancy stuff they are surely doing in the desktop for the desktop only.

Edit: on a second check, it might have to do with that nav that moves the whole page down.

plambOP9y ago

Appreciate these comments, the site did not go through much testing before being deployed. Overflowing was modified to eliminate horizontal scroll on mobile but it looks like there were some vertical issues as well. We will get this fixed

Loic9y ago

What is the algorithm used to join the tables? Is it a hash join on `id` and `k` or using the fact that the ids are sorted and using a kind of galloping approach?

jagsr1239y ago

Yes, it is a hash join.

Loic9y ago

I will need to dig into the implementation of the hash function, it must be a nice read as the speed shows that it is definitely well optimized! Thank you.

alexchamberlain9y ago

Python 2.7 can do it in 0.0867 usec (Intel i7);

    $ python2.7 -m timeit 'n=10**9; (n*n + n) / 2'
    10000000 loops, best of 3: 0.0867 usec per loop

(Admittedly, I killed `n=109; sum(range(1,n+1))`.)

marknadal9y ago

Great article, actually. Typical HN comments on performance optimizations are complaints like "this isn't a real world use case" or things like that. Most of which, they miss that comparing baseline performance metrics against two systems is still genuinely interesting in and of by itself, and acts as a huge learning catalyst to understanding what is going on. I think this article did a great job of making an honest comparison and discussing what is going on, so props to the team! (We did something similar as well, where we compared cached read performance against Redis, and were 50X faster - here: https://github.com/amark/gun/wiki/100000-ops-sec-in-IE6-on-2... ).

banachtarski9y ago

The problem is what "baseline" means. For example, a multithreaded program will always run slower than a single threaded one on one thread by definition. It has to do work in order to coordinate the threads. Obviously, this doesn't mean we avoid multithreaded code.

In this case, the software being tested was explicitly written to manage the coordination of data on many nodes, so why is the definition of "baseline" a single laptop? Seems specious.

marknadal9y ago

Yes, but that is exactly why I think these types of articles and discussions are useful - people who understand what is going on often times assume that so do others, but for many people it all looks like magic.

How many people out there (genuine question here) assume the opposite of what you know / that they are ignorant of it? How many people do you think that when they hear "multithreaded" that they associate that with being faster?

Now assume the people who know there is overhead work to split up and divide the work across threads... because they have this knowledge, also "see everything as a nail because they have a hammer"? That sometimes the right solution is to simply run a single threaded operation, not parallelize everything?

I think there are interesting merits to all of that, even if it means "hyperbolic" articles or cliche not-realistic world tests. They challenge our thinking, our assumptions, our approach. And then separately, there should be articles/discussions on real-world tests and use cases.

1 more reply

Bedon2929y ago

I know its just a benchmark for comparison, and it is awesome. I love seeing cool comparisons like this, but why do I care that this particular benchmark is faster than Spark? What sort of analytics will be affected by this improvement, and will it actually be saving me time on real world use cases?

plambOP9y ago

Our impression was that when Databricks released the billion-rows-in-one-second-on-a-laptop benchmark, readers were pretty awed by that result. We wanted to show that when you combine an in-memory database with Spark so it shares the same JVM/block manager, you can squeeze even more performance out of Spark workloads (over and above Spark 's internal columnar storage). Any analytics that require multiple trips to a database will be impacted by this design. E.g. workloads on a Spark + Cassandra analytics cluster will be significantly slower, barring some fundamental changes to Cassandra.

jagsr1239y ago

Good questions. One answer - Speed in analytics when working with large data sets matters. A lot. Think about this - several vendors seem to be claiming support for interactive analytics. i.e. i can ask an adhoc question on large volume of data and get some sort of insight in "interactive" times. Really? maybe with a thousand cores? In a competitive industry, say like in investment banking, if one can discern a pattern before the competition it simply provides an edge. Ask the question to folks involved in detecting fraud, or online portals trying to place an appropriate ad, or manufacturing plant trying to anticipate/predict faults. It isn't so much about trying to prove we are better than Spark (well other than grabbing some attention :-) ) but rather the potential to live in a world where batch processing is a thing of the past - like working with mainframes. The hope is that we can gain insight instantly. fwiw, we love Spark and expect some of these optimizations simply become part of Apache Spark.

luckydata9y ago

it's literally impossible to test every possible workload for a tool like Spark so... what's the point of asking? Stand up a testing cluster, run some of your jobs and you'll get the answer.

Bedon2929y ago

I am not asking about every workload. I was just curious about an example workload where this benchmark matters.

supergirl9y ago

why would you choose values between 1 and 1000 for the right side? why not 1000 values between 1 and 1 billion?

zzleeper9y ago

In case the author reads this: I can't read well with that font, unless I zoom in all the way. Doesn't happen with anything else (Win10, 14in laptop, Chrome)

plambOP9y ago

The font in the embedded gists or the font on the page?

minimaxir9y ago

Likely the font on the page.

A web design QA note for all: thin fonts (e.g 300-400 weight) as a body font but work fine on macOS due to better font rendering, but do not work well on Windows.

2 more replies

j / k navigate · click thread line to collapse

83 comments

banachtarski9y ago

Am I reading this correctly? The testbed was a single laptop? A big part of spark is the distributed in-memory aspect so I'm not sure I understand why any of these numbers mean anything.

quantumhobbit9y ago

This paper is a must read: https://pdfs.semanticscholar.org/6753/959eed800e9fad9e330daa...

People keep stumbling upon the same thing over and over which is that the ability to scale has significant overhead.

edw9y ago

If you don't work for Netflix or Google or Facebook or insert maybe a hundred other companies here, you probably do not have a Big Data problem.

8 more replies

banachtarski9y ago

The correct way to compare software A and software B is to benchmark both on the target platform/hardware they were respectively written for. Afterwards, do a cost-benefit analysis.

edit: (accidentally hit the submit button early).

1 more reply

dajohnson899y ago

I'm not sure why you got downvoted. It's a valid point and it makes intuitive sense.

1 more reply

jupiter900009y ago

My point is, these 'on a laptop/single machine memory' examples don't really give me an indicator of scenarios where I might actually want to use spark/etc.

j-m-o9y ago

[1] https://issues.apache.org/jira/browse/PHOENIX-3600

dandermotj9y ago

You're trying to GROUP BY on a distributed data store; your code is the problem, not Spark SQL. Use CLUSTER BY - it's distributed sibling.

Query languages like HiveQL and Spark SQL were designed to look like SQL, but they're not.

1 more reply

jagsr1239y ago

sumwale9y ago

filereaper9y ago

I apologize in advance, but whenever people claim to use a in-memory big-data system, how exactly does this end up working?

stingraycharles9y ago

I think there are many use cases. Fraud detection, risk analysis in finance, weather simulations, etc. These don't need to spill out to disk and are a perfect use case for these systems.

rodionos9y ago

You can also measure cloud oktas from satellite imagery of you want to get fancy in terms of solar energy supply side forecasting: https://axibase.com/calculating-cloud-oktas/

blhack9y ago

Maybe I'm misunderstanding the problem, but why can't you scale out horizontally?

If the problem is that queries or sets of data might have to jump nodes, couldn't the data be designed in such a way where an assumption is made about what sorts of queries will happen at write?

Optimize so that node spanning is rare, eat the cost when it does happen, and let those 1/n queries disappear into the average.

SapphireSun9y ago

alexchamberlain9y ago

It's not big data if it fits in memory... This article is demonstrating an architecture that may scale well with big data.

usgroup9y ago

Lol was hoping it was a combination of awk and paste :)

That always makes me chuckle.

Honestly though ... Jenkins + bash + cloud storage and you'll be surprised at how many big data problems you can solve with a fraction of the complexity.

pmlnr9y ago

https://aadrake.com/command-line-tools-can-be-235x-faster-th...

Discussion is here: https://news.ycombinator.com/item?id=8908462

makapuf9y ago

Pardon my ignorance but what would you use Jenkins for ? Scheduling ?

tetha9y ago

usgroup9y ago

Triggering jobs more generally (schedule or push notification) and splitting things into jobs and/or pipelines.

EGreg9y ago

franciscop9y ago

Edit: on a second check, it might have to do with that nav that moves the whole page down.

plambOP9y ago

Loic9y ago

What is the algorithm used to join the tables? Is it a hash join on `id` and `k` or using the fact that the ids are sorted and using a kind of galloping approach?

jagsr1239y ago

Yes, it is a hash join.

Loic9y ago

I will need to dig into the implementation of the hash function, it must be a nice read as the speed shows that it is definitely well optimized! Thank you.

alexchamberlain9y ago

Python 2.7 can do it in 0.0867 usec (Intel i7);

    $ python2.7 -m timeit 'n=10**9; (n*n + n) / 2'
    10000000 loops, best of 3: 0.0867 usec per loop

(Admittedly, I killed `n=109; sum(range(1,n+1))`.)

marknadal9y ago

banachtarski9y ago

In this case, the software being tested was explicitly written to manage the coordination of data on many nodes, so why is the definition of "baseline" a single laptop? Seems specious.

marknadal9y ago

1 more reply

Bedon2929y ago

plambOP9y ago

jagsr1239y ago

luckydata9y ago

it's literally impossible to test every possible workload for a tool like Spark so... what's the point of asking? Stand up a testing cluster, run some of your jobs and you'll get the answer.

Bedon2929y ago

I am not asking about every workload. I was just curious about an example workload where this benchmark matters.

supergirl9y ago

why would you choose values between 1 and 1000 for the right side? why not 1000 values between 1 and 1 billion?

zzleeper9y ago

In case the author reads this: I can't read well with that font, unless I zoom in all the way. Doesn't happen with anything else (Win10, 14in laptop, Chrome)

plambOP9y ago

The font in the embedded gists or the font on the page?

minimaxir9y ago

Likely the font on the page.

A web design QA note for all: thin fonts (e.g 300-400 weight) as a body font but work fine on macOS due to better font rendering, but do not work well on Windows.

2 more replies

j / k navigate · click thread line to collapse