People keep stumbling upon the same thing over and over which is that the ability to scale has significant overhead.
Most problems are not Big Data problems. The size a problem must be before it qualifies as a Big-Data problem grows larger every day with the availability of machines with ever-more cores and memory. `Sed`, `awk`, `grep`, `sort`, `join`, and so forth are some of the least appreciated tools in the Unix toolbox.
People want to think they have Big Data problems but they probably just have plain old normal-data problems. I have had to unwind the ridiculous, heavy-weight, Big Data solutions to normal-data problems that "kids today" love.
If you don't work for Netflix or Google or Facebook or insert maybe a hundred other companies here, you probably do not have a Big Data problem.
edit: (accidentally hit the submit button early).
I don't think people should leverage highly distributed software for small workloads, for the same reason they shouldn't write highly parallelized code for things that run perfectly fine on one thread. But the test, while well-intentioned, seems to miss the mark.
Is there a clear cut answer, as to whether one should choose a distributed solution or not? It seems to me that if you're at the Terabyte scale, choosing non-distributing seems to be asking for trouble. A quick search indicates the largest HDD you can buy is around 8TB.
My point is, these 'on a laptop/single machine memory' examples don't really give me an indicator of scenarios where I might actually want to use spark/etc.
Query languages like HiveQL and Spark SQL were designed to look like SQL, but they're not.
You can only stuff so much into memory, so you can scale up vertically in-terms of memory, unless you buy a massive big-iron POWER box, you scale out horizontally. But with each of these in-memory appliances, what happens when you need to spill out to disk?
In essence why should one bother with these in-memory appliances as opposed to buying boxes with fast SSD's instead? Sure you spill out to disk, but do you take that big of a hit compared to the enormous cost of keeping everything in memory?
A friend of mine works for a company that does high speed weather analysis to make predictions for energy brokers, to predict prices of wind / solar energy on the market. They use these kind of systems extensively, because of the speed and volatility of the data. Fascinating stuff.
If the problem is that queries or sets of data might have to jump nodes, couldn't the data be designed in such a way where an assumption is made about what sorts of queries will happen at write?
Optimize so that node spanning is rare, eat the cost when it does happen, and let those 1/n queries disappear into the average.
Imagine the difference between setting up a spark cluster and writing a for loop. For instance, for reasons someone created a 1TB hdf5 file. Luckily, we had a computer with 500GB+ of ram and lots of swap, so instead of having to hack the file apart and figure out how to chunk or parallelize it, we loaded it into memory for a one time batch job and did other useful things in the mean time.
That always makes me chuckle.
Honestly though ... Jenkins + bash + cloud storage and you'll be surprised at how many big data problems you can solve with a fraction of the complexity.
Edit: on a second check, it might have to do with that nav that moves the whole page down.
$ python2.7 -m timeit 'n=10**9; (n*n + n) / 2'
10000000 loops, best of 3: 0.0867 usec per loop
(Admittedly, I killed `n=109; sum(range(1,n+1))`.)In this case, the software being tested was explicitly written to manage the coordination of data on many nodes, so why is the definition of "baseline" a single laptop? Seems specious.
How many people out there (genuine question here) assume the opposite of what you know / that they are ignorant of it? How many people do you think that when they hear "multithreaded" that they associate that with being faster?
Now assume the people who know there is overhead work to split up and divide the work across threads... because they have this knowledge, also "see everything as a nail because they have a hammer"? That sometimes the right solution is to simply run a single threaded operation, not parallelize everything?
I think there are interesting merits to all of that, even if it means "hyperbolic" articles or cliche not-realistic world tests. They challenge our thinking, our assumptions, our approach. And then separately, there should be articles/discussions on real-world tests and use cases.
A web design QA note for all: thin fonts (e.g 300-400 weight) as a body font but work fine on macOS due to better font rendering, but do not work well on Windows.