Almost always numpy matrix math + cython or C or Java on a single machine is enough. Not always-always; but if you can relax requirements slighly say by accepting a 45 minute lag from new data impacting the total model, or by caching the results of the top 10k most likely queries, or by putting more effort into stripping out the garbage parts of the data, or, sometimes, just throwing a $10k a month server or mathematician at the problem (sure is cheaper than a bunch of cheap servers + larger infrastructure team).
The times you need real scalability you know you need it. You'd laugh at how silly someone would be for trying to put it onto one machine. You're solving the travelling salesman problem for UPS (although I can think of some hacks here - I probably can't get it down to a single machine), or you're detecting logos in every Youtube video ever made, or you're working for the NSA.
Even if you know for sure you're going to need scalability. I don't think it hurts to just do it on a single box at first. Iterating quickly on the product is more important and once you have something proven you can get money from the market or from VCs to distribute it.
We could write 30 microservices deployed on 30 docker images with load balancing and FT and all that magic for a basic webapp...
Or we could just write a pretty fast webserver and do it with 1 server. (Or if it is stateless, do it with a few for still a lot less work than a giant microservice cluster).
I think in the last year or so microservices have become a little less cool, and people are more along the lines of "code cleanly so we can microservice if we need to down the road, but don't deploy it like that for 1.0"... seems similar for this.
And then they forget if they had just modified that one query and tweaked that one for-loop they could've had that same capacity without launching six new servers with all kinds of potential for the wiring to go down and cause downtime. Plus the dev time to build the services.
I still think containers are great, and I really like the abstractions kubernetes provides, but if I ever had enough traffic to worry about scaling, I envision running a small cluster of very powerful computers rather than 100s of weak ones.
That was the plan. (I won't go into the political bs at a company I no longer work with).
Do you have any examples of problems at which you have thrown a mathematician or could imagine doing so?
The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory. We know that because Amdahl's law tells us that parallelizing something invariably adds overhead. So from a systems perspective that overhead has to be "paid for" by some other improvement, and we'll see that it is access to data.
If your task is to process a 2TB data set, on a single machine using a 6GBS SATA channel and 2TB of FLASH SSDs you can read in that dataset into memory in 3333 seconds (at 600MB/sec which is very optimistic for our system), process it, and lets say you write out a 200GB reduced data set for another 333 seconds. so, conveniently, an hour of I/O time.
Now you take that same dataset and distribute it evenly across a thousand nodes. Each one then has 2GB of the data on it. Each node can read in their portion of the data set in 3 seconds, process it and write out their reduction in .3 seconds.
You have "paid" for the overhead of parallelization by trading an I/O cost of an hour for an I/O cost of about 4 seconds.
That is when parallel data reduction architectures are better for a problem than a single threaded architecture. And that "betterness" is purely artificial in the sense that you would be better off with a single system that had 1,000 times the I/O bandwidth (cough mainframe cough) than 1,000 systems with the more limited bandwidth. However a 1,000 machines with one SSD it still cheaper buy than one mainframe of similar capability. So if, and its a big if, your algorithm can be expressed as a data map / reduce problem, and your data is large enough to push the cost of getting it into memory to have a look at significantly beyond cost of executing the program, then you can benefit positively by running it on a cluster rather than running it on a local machine.
Still, I think the general rule applies that if you can buy a server that will fit your dataset into RAM, probably you don't need something like Hadoop.
So from a technical point of view, 50GB / sec on a single machine vs 600GB / sec on a 1000 node cluster. From a cost perspective, running 1 machine is going to be a lot less than running 1,000 machines.
Consider some other aspects as well. If a machine breaks, and you have only one, you are offline, if one breaks and you have 999 left, you're still 99.9% up and running. If you work in 2TB data sets, how many do you have? One? Two? Twenty? The more you have the more storage you end up putting on a machine, and even with a SAN the ability to move terabytes around is a pain. Then there is the enterprise value of the analysis. How much does the analysis add value to the product you sell? In the paper's example of Page Rank one could argue it really made Google's engine better so a lot of value. In an oil and gas context it might be the difference between finding oil or not, so again high value. But in a twitter 'bot' analysis, killing off all the identified bot accounts might have very little relative value to the overall business.
The bottom line is that none of these sorts of choices can be made in isolation. Looking at the choices through a single lens, whether it is performance, cost, or capability, is rarely sufficient to make the best choice. What is more the best choice may seem like a "bad" choice from an engineering perspective but great from a finance perspective. Similarly a good choice from a finance perspective could be a horrible choice from an engineering perspective.
What is important is to keep in mind the strengths of the various choices available to you, and their weaknesses. Then to select from them based on the current and future requirements for the resulting system.
Still not 1000 machines, but one cost in this scenario is that you also won't have to pay for the humans to run the network the 1000 machines live on, or the power for 999 machines, or the cooling, or the floorspace, etc.
"RAM is for the stack and SSDs are for heap" is a phrase I'm starting to hear.
There are plenty of cases where your rule doesn't hold true, and I don't think that they're that uncommon.
Plus having many machines has the advantage of a higher granularity of control over workload and balancing.
It is about the limitation of overall performance improvement when improving the performance of a component. If, say, you improve the performance of a piece of your application that only accounts for 10% of the running time, then you're limited by a 10% total performance improvement. This matters in a parallel context because if you have heavily parallelized your application, and it has actually improved performance, you're likely to eventually be limited by the performance of the sequential parts of your application. See https://en.wikipedia.org/wiki/Amdahl%27s_law ; although the last line of the introduction reads like nonsense.
... well, then there is the issue of distributing that 2TB data set. I'll get to the Amdahl's law issue in a moment.
This is a non-trivial problem. Ok, it is trivial, but its serial in most cases. Unless you start out with a completely distributed data set. And allocate permanent space on those 1000 nodes. So the data has to move once. And you can amortize that across all the runs.
In reality, you cant. We have customers using PB of data for their analytics. Even across 1000 nodes, thats still TB scale.
Our approach is not radical, its simple. Build a better architecture system, with much higher bandwidth/lower latency interconnects, so data motion can happen at 10-20 GB/s per machine. Then you can walk through your data in 50-100 seconds per machine (our customers do). And if you need to scale up/out, use 100Gb nets, and other things.
On Amdahl's law: In its simplest form, the law states that your performance is bound by the serial portion of the computation. If you can drive the parallel portion to zero time, you are still stuck with the serial portion literally bounding your performance. So lets take your example.
1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution. And at 1GbE, you can move 2GB in about 20 seconds (hurray!). But you've got 1000 nodes, so its 20x 1000, or 1/4 day. Remember, the data starts out in one bolus, unless you allocate those machines and their storage permanently. That type of allocation would be cost prohibitive.
Ok, use 10GbE. You'll actually get about 2-4 GbE speeds, but ok, So maybe 5-10k seconds to move your data. And your run is deeply in the noise, at 4 seconds.
Still not good.
For less than the cost of doing this with capable/fast machines in the cloud where you have to keep moving your data back and forth, you could get a simple bloody fast machine, that can handle the data read in 50-100 seconds.
Our thesis (ok, tooting our horn now) is that systems architecture matters for high performance large data analytics. Seymour Cray's statement about 2 strong oxen vs 1024 chickens is apt around now.
Cheap and deep are great for non-intensive data motion and analytics. Not so much for very data intensive analytics.
Again, I am biased, as this is what my company does.
I find your statements confusing.
The whole point of things like hadoop is that the data is already distributed and the data storage nodes are also computational nodes. So there is no data distribution that takes 1/4 day or even 50-100 seconds. It takes 0 seconds because you just run the computation where the data already is.
My experience with HPC systems is more "Jobs are paused because the shared filesystem is unavailable"
Which company do you work for? (unless it's a secret for some reason)
Also, aggregate network bandwidth. A major use case for clustered processing is that it is MUCH faster to download external data in parallel across a cluster than in parallel on one box. If timeliness is a major use case of the system you're building, this basically requires a cluster, unless you want to end up re-implementing cluster functionality yourself.
No it isn't. There are plenty of CPU-bound tasks that run much faster if the work is distributed in parallel across multiple machines. We use Hadoop at my company primarily for that reason.
I know there are modelling work loads, like weather forecasting and analysing seismic data - but curious what kind of work you are doing?
So while your analysis is valid, there are more "costs" at play like developer time, cluster maintenance, hardware. I like to play with Spark's ML libraries but am wary about designing projects specifically around them because of this overhead, especially when trying to distribute some API/tech that you'd like others to use.
Not trying to be a downer, I actually wish the choice to go distributed was more of a no-brainer, hah. Would love for some APIs to emerge that could be used locally/distributed transparently without actually having to run a dummy cluster & data migration to run locally.
- Peter Principle: most decision makers are/feel technically insecure in the blog driven tech age, and cave in to direction from below. Of course, young developers want to play with shiny new things (given the general drudgery of the work involved).
- Emergence of DevOps: Engineers are being commoditized. There is an undeniable deskilling that goes hand in hand with having to wear all the technical hats. (A side glance here to pattern of deskilling of pilots in the age of fly by wire.) Sure, you will need to learn new 'tools' as 'operators', but what's the vote HN: what percent of these engineers could actually build one of these distributed systems? (To say nothing of being able to reign in the asynchronous distributed monster when it starts hitting its pain points.)
- You're not Google: I'm rather blunt when a team points to "Google does it". Google and the like have made a virtue out of necessity. Google/Facebook/Netflix/etc. had to resort to the pattern of lots of disposable commodity boxes. They also have the chops in house to field SREs that are simply not going to play machine room operator for enterprise IT.
The overwhelming majority of systems out there can run on a deployment scheme that 1:1 matches the logical diagram (x2 for fail over). And yes, it is amazing what one can do on a single laptop these days.
If you have code that is not able to run any more in a scripting language, and it is not embarrassingly parallel, you have two choices.
1. Move to something like C++, and optimize the heck out of it. You will gain something like 1-2 orders of magnitude in performance and then hit a wall.
2. Move to a distributed architecture. You immediately lose 1-2 orders of magnitude in performance, but then can scale essentially forever.
If you expect your distributed system to need less than 100 machines, you should seriously consider option #1.
1. Parallelize on a single machine. Most scripting languages have single-threaded bottlenecks (Python, Ruby, node.js, etc.), so this means using multiple processes. xargs -P goes a long way. The coding changes are essentially a subset of what you need to distribute your program anyway.
32x or 64x speedup is nothing to sneeze at on a modern machine. The difference between 5 minutes and 5 hours usually solves your problem, practically speaking. And this means you don't have to touch every line of code, as you would if you were doing a C++ rewrite.
But also don't forget that you can often rewrite 10% of your code in C++ and keep the other 90% in Python, and get a 10x speedup. This requires fairly deep understanding of both your program and of the Python/C interface. It helps to adopt a data flow style so you are not crossing the boundary a lot. And make sure you release the GIL, and consider starting threads in C++ rather than in Python, etc.
I've also optimized an R program in C++ and gotten 125x speedup -- and that's single threaded C++; multithreaded would be another 2 orders of magnitude!!! But it also involved fixing a bunch of R performance errors, so don't underestimate that too.
In practice, any program which you actually care enough about to rewrite for speed usually has some application-level performance bugs -- i.e. slowness unrelated to the slowness of the underlying language platform.
People scale out to many machines because they don't want to rewrite every line of code. But the first step is to "scale out" by using all the cores on a single machine. In some sense, this is the best of both worlds, because you are incurring neither the overhead of distribution (serialization and networking) nor the complexity of distributed error handling.
In my experience, you're best off optimizing relatively limited pieces of a system rather than big applications. And only consider the rewrite when you've looked at optimizing it in place. This means that the thing that needs to be optimized with a rewrite often has a chance of not having any giant stupid mistakes.
For example see http://bentilly.blogspot.com/2011/02/finding-related-items.h... for a case where I found a thousandfold speed increase by rewriting from SQL to C++. As much as was reasonably possible, I did not change the basic algorithm.
Sounds like you had a problem that was "embarrassingly parallel". It's not going to be as simple if you have graph problems which require lots of IPC in python/ruby/v8. That's where moving to a language which has threading support can help.
No way am I a scalability expert, and I really don't have time to be one. I started using Spark when I had to sort 10 TiB on disk, and it scored the highest on sorting performance. I struggled with implementing a fast disk sort quickly, and I gave Spark a whirl, and it fixed my problem, fast. Since then, I've found it useful in a lot of other ways.
Creating a distributed system is very difficult, even when using platforms like Spark. Not all algorithms can be scaled easily or scaled at all, and not all algorithms in Spark MLLib or GraphX are actually designed to be truly distributed or work equally well for all use cases/data.
We tried to implement one of our algorithms (written in Java) that was taking hours on a single machine (even when using all the cores) using methods from Spark MLLib just to find that Spark job was constantly crashing. Turned out that some of the functions just fetch all the data to the "driver" instance and calculate the result there.
My guess this is what happens with author's use case - yes, he ran it on Spark, but only one node ended up crunching all numbers. And/or network overhead of course.
After we found out that MLLib can't give us what we need, we reimplemented it from scratch in Scala, making sure we balance the load equally and don't have too much network (shuffle) traffic between the nodes.
As a result, we went from 2.5 hours on a single machine, to under 2 minutes on a cluster of 25 instances (same Ivy Bridge processor, just more cores per node). The algorithm scaled almost linearly, but it required carefully designing it with Spark specifics in mind.
On other side AWS now offers 2TB RAM machine. And single huge machine has smaller per GB cost than several smaller machines. I think clustered computing as we know will be soon gone. Only reason for multiple machines will be availability.
And for that matter, even when we can't stuff it in RAM, the boundaries of what we can do on a single server is also constantly pushed back thanks to SSDs. It's just a few years ago since I was unable to get read speeds of more than 6GB/sec out of a RAM disk. Today I have servers that easily do 2GB/sec out of NVMe SSDs.
It's not that we never need to go beyond a single server. But people often really have no concept of when they'll need to.
VISA does what, 150M transactions per day? I was doing more volume than that for telephone calls a couple years ago with a $5K server and a default install of MSSQL. (Full ACID, updating balances -- yes I know a CC tx is probably heavier than a VoIP call but still.) At 4KB per tx, VISA could use VoltDB and store all tx's in RAM for a week for like a million or two.
An 150M a day today is not that much more than it was 15 years ago it seems. (50-100% more?).
For many, many, businesses, data size just isn't an issue any more cost wise, and soon won't be a technical challenge at all either. Yet 10, 20 years ago, we're talking 8-digit+ implementations.
Sure there's some things that grow faster, like all this increased tracking. But in general?
This is true not just for computing algorithms, but for developer time/brain space as well. Single-threaded applications are far simpler to understand.
The takeaway shouldn't be "test it on a single laptop first", but rather "will the volume/velocity of data now/in the future absolutely preclude doing this on a single laptop". At my work, we process probably a hundred TB in a few-hour batch processing window at night, Terabytes of which remain in memory for fast access. There is no choice there but to pay the overhead.
The website is down, but the HN discussion is still there : https://news.ycombinator.com/item?id=9581862.
In fact the top comment there links to the original post here.
Maybe they ran out of ram?
I was browsing through dat for awhile, but haven't caught up with it lately:
https://github.com/maxogden/dat
Basically disk is so cheap that you should just keep 2 or 3 copies of your data around. And then you can sync them really quickly and do the processing on any one of N machines.