Scalability, but at what cost? (2015) (opens in new tab)

(frankmcsherry.org)

165 pointsr-u-serious9y ago77 comments

77 comments

I've been doing data analysis and machine learning client work for quite some time now and for companies as small as a 3 person startup to advising a department of the Canadian government.

Almost always numpy matrix math + cython or C or Java on a single machine is enough. Not always-always; but if you can relax requirements slighly say by accepting a 45 minute lag from new data impacting the total model, or by caching the results of the top 10k most likely queries, or by putting more effort into stripping out the garbage parts of the data, or, sometimes, just throwing a $10k a month server or mathematician at the problem (sure is cheaper than a bunch of cheap servers + larger infrastructure team).

The times you need real scalability you know you need it. You'd laugh at how silly someone would be for trying to put it onto one machine. You're solving the travelling salesman problem for UPS (although I can think of some hacks here - I probably can't get it down to a single machine), or you're detecting logos in every Youtube video ever made, or you're working for the NSA.

Even if you know for sure you're going to need scalability. I don't think it hurts to just do it on a single box at first. Iterating quickly on the product is more important and once you have something proven you can get money from the market or from VCs to distribute it.

brianwawok9y ago

This is kind of the same argument as microservices.

We could write 30 microservices deployed on 30 docker images with load balancing and FT and all that magic for a basic webapp...

Or we could just write a pretty fast webserver and do it with 1 server. (Or if it is stateless, do it with a few for still a lot less work than a giant microservice cluster).

I think in the last year or so microservices have become a little less cool, and people are more along the lines of "code cleanly so we can microservice if we need to down the road, but don't deploy it like that for 1.0"... seems similar for this.

ben_jones9y ago

People forget that the two methods of scaling are HORIZONTAL and VERTICAL. They think: "I can just put some micro-services behind HA proxy and boom, more capacity!".

And then they forget if they had just modified that one query and tweaked that one for-loop they could've had that same capacity without launching six new servers with all kinds of potential for the wiring to go down and cause downtime. Plus the dev time to build the services.

2 more replies

sigusr19y ago

If you haven't seen stackoverflows server architecture posts you'd probably enjoy them: https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...

I still think containers are great, and I really like the abstractions kubernetes provides, but if I ever had enough traffic to worry about scaling, I envision running a small cluster of very powerful computers rather than 100s of weak ones.

2 more replies

geofft9y ago

See also https://circleci.com/blog/its-the-future/, which is almost too true to be funny.

tracker19y ago

My solution was more about higher availability, and staggered deployments... but each of about 8 services (including web) would be deployed to 3 servers, identical config with dokku... then there would be a load balancing nginx instance that pointed to the app.foo on each of the three servers.. deployment would update/cycle nginx, deploy to first server, roll over, then deploy to the other two.

That was the plan. (I won't go into the political bs at a company I no longer work with).

Bromskloss9y ago

> throwing a $10k a month server or mathematician at the problem

Do you have any examples of problems at which you have thrown a mathematician or could imagine doing so?

ChuckMcM9y ago

I like the analysis, basically it says "hey you don't have big data" :-) but that requires a bit more explanation.

The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory. We know that because Amdahl's law tells us that parallelizing something invariably adds overhead. So from a systems perspective that overhead has to be "paid for" by some other improvement, and we'll see that it is access to data.

If your task is to process a 2TB data set, on a single machine using a 6GBS SATA channel and 2TB of FLASH SSDs you can read in that dataset into memory in 3333 seconds (at 600MB/sec which is very optimistic for our system), process it, and lets say you write out a 200GB reduced data set for another 333 seconds. so, conveniently, an hour of I/O time.

Now you take that same dataset and distribute it evenly across a thousand nodes. Each one then has 2GB of the data on it. Each node can read in their portion of the data set in 3 seconds, process it and write out their reduction in .3 seconds.

You have "paid" for the overhead of parallelization by trading an I/O cost of an hour for an I/O cost of about 4 seconds.

That is when parallel data reduction architectures are better for a problem than a single threaded architecture. And that "betterness" is purely artificial in the sense that you would be better off with a single system that had 1,000 times the I/O bandwidth (cough mainframe cough) than 1,000 systems with the more limited bandwidth. However a 1,000 machines with one SSD it still cheaper buy than one mainframe of similar capability. So if, and its a big if, your algorithm can be expressed as a data map / reduce problem, and your data is large enough to push the cost of getting it into memory to have a look at significantly beyond cost of executing the program, then you can benefit positively by running it on a cluster rather than running it on a local machine.

Joeri9y ago

For a 2 TB dataset you could also pay supermicro 50k to get a 40 core 3 TB RAM monster that can keep that whole dataset in RAM. At 50 GB / sec throughput that would keep your query roundtrip time at somewhere around the minute mark. Not quite 3 seconds, but then not quite a thousand nodes either. Of course, rebooting that machine would be awkward.

Still, I think the general rule applies that if you can buy a server that will fit your dataset into RAM, probably you don't need something like Hadoop.

ChuckMcM9y ago

I think this is also a vary salient observation. It is always possible to construct really large systems, right up to mainframes. And if you're asked to do so it helps to take as wide a view on the problem as you can.

So from a technical point of view, 50GB / sec on a single machine vs 600GB / sec on a 1000 node cluster. From a cost perspective, running 1 machine is going to be a lot less than running 1,000 machines.

Consider some other aspects as well. If a machine breaks, and you have only one, you are offline, if one breaks and you have 999 left, you're still 99.9% up and running. If you work in 2TB data sets, how many do you have? One? Two? Twenty? The more you have the more storage you end up putting on a machine, and even with a SAN the ability to move terabytes around is a pain. Then there is the enterprise value of the analysis. How much does the analysis add value to the product you sell? In the paper's example of Page Rank one could argue it really made Google's engine better so a lot of value. In an oil and gas context it might be the difference between finding oil or not, so again high value. But in a twitter 'bot' analysis, killing off all the identified bot accounts might have very little relative value to the overall business.

The bottom line is that none of these sorts of choices can be made in isolation. Looking at the choices through a single lens, whether it is performance, cost, or capability, is rarely sufficient to make the best choice. What is more the best choice may seem like a "bad" choice from an engineering perspective but great from a finance perspective. Similarly a good choice from a finance perspective could be a horrible choice from an engineering perspective.

What is important is to keep in mind the strengths of the various choices available to you, and their weaknesses. Then to select from them based on the current and future requirements for the resulting system.

JoachimSchipper9y ago

Idle question: is 50 GB/s a reasonable throughput to expect for such a monster? DDR4 has a peak transfer rate of ~12.8-19.2 GB/s per stick (per https://en.wikipedia.org/wiki/DDR4_SDRAM), so I'd expect quite a bit more bandwidth for predictable accesses - are you using some useful rule-of-thumb, or just unduly pessimistic? ;-)

1 more reply

bane9y ago

I'd point out that modern SSDs are getting within the an order of magnitude of RAM in terms of speed. If your dataset is larger than what can fit in RAM, you can probably come up with some kind of on-disk storage mechanism, at the cost of speed. Buy a few and RAID them and you can probably get even closer.

Still not 1000 machines, but one cost in this scenario is that you also won't have to pay for the humans to run the network the 1000 machines live on, or the power for 999 machines, or the cooling, or the floorspace, etc.

"RAM is for the stack and SSDs are for heap" is a phrase I'm starting to hear.

nostrebored9y ago

How about when that operation is at the core of your business logic and happens multiple time a day? How about when it's a blocking operation?

There are plenty of cases where your rule doesn't hold true, and I don't think that they're that uncommon.

Plus having many machines has the advantage of a higher granularity of control over workload and balancing.

1 more reply

scott_s9y ago

Amdahl's law is not about overhead - it is true that parallelizing a computation usually has some cost, and you have to make sure what you're parallelizing is a heavy-enough computation that you still win in the presence of that cost. But that's not what Amdahl's law is about.

It is about the limitation of overall performance improvement when improving the performance of a component. If, say, you improve the performance of a piece of your application that only accounts for 10% of the running time, then you're limited by a 10% total performance improvement. This matters in a parallel context because if you have heavily parallelized your application, and it has actually improved performance, you're likely to eventually be limited by the performance of the sequential parts of your application. See https://en.wikipedia.org/wiki/Amdahl%27s_law ; although the last line of the introduction reads like nonsense.

sijoe9y ago

[old HPC guy here, you kids get off my lawn!]

... well, then there is the issue of distributing that 2TB data set. I'll get to the Amdahl's law issue in a moment.

This is a non-trivial problem. Ok, it is trivial, but its serial in most cases. Unless you start out with a completely distributed data set. And allocate permanent space on those 1000 nodes. So the data has to move once. And you can amortize that across all the runs.

In reality, you cant. We have customers using PB of data for their analytics. Even across 1000 nodes, thats still TB scale.

Our approach is not radical, its simple. Build a better architecture system, with much higher bandwidth/lower latency interconnects, so data motion can happen at 10-20 GB/s per machine. Then you can walk through your data in 50-100 seconds per machine (our customers do). And if you need to scale up/out, use 100Gb nets, and other things.

On Amdahl's law: In its simplest form, the law states that your performance is bound by the serial portion of the computation. If you can drive the parallel portion to zero time, you are still stuck with the serial portion literally bounding your performance. So lets take your example.

1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution. And at 1GbE, you can move 2GB in about 20 seconds (hurray!). But you've got 1000 nodes, so its 20x 1000, or 1/4 day. Remember, the data starts out in one bolus, unless you allocate those machines and their storage permanently. That type of allocation would be cost prohibitive.

Ok, use 10GbE. You'll actually get about 2-4 GbE speeds, but ok, So maybe 5-10k seconds to move your data. And your run is deeply in the noise, at 4 seconds.

Still not good.

For less than the cost of doing this with capable/fast machines in the cloud where you have to keep moving your data back and forth, you could get a simple bloody fast machine, that can handle the data read in 50-100 seconds.

Our thesis (ok, tooting our horn now) is that systems architecture matters for high performance large data analytics. Seymour Cray's statement about 2 strong oxen vs 1024 chickens is apt around now.

Cheap and deep are great for non-intensive data motion and analytics. Not so much for very data intensive analytics.

Again, I am biased, as this is what my company does.

justinsaccount9y ago

> 1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution.

I find your statements confusing.

The whole point of things like hadoop is that the data is already distributed and the data storage nodes are also computational nodes. So there is no data distribution that takes 1/4 day or even 50-100 seconds. It takes 0 seconds because you just run the computation where the data already is.

My experience with HPC systems is more "Jobs are paused because the shared filesystem is unavailable"

1 more reply

e12e9y ago

Sounds like you are doing the opposite of what joyent are doing: they basically pair (relevant parts of) data with the program (to process that part). And the reduce/aggregate over the result (that's my takeaway from joyent's marketing, anyway).

Which company do you work for? (unless it's a secret for some reason)

1 more reply

cle9y ago

> The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory.

Also, aggregate network bandwidth. A major use case for clustered processing is that it is MUCH faster to download external data in parallel across a cluster than in parallel on one box. If timeliness is a major use case of the system you're building, this basically requires a cluster, unless you want to end up re-implementing cluster functionality yourself.

vonmoltke9y ago

> The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory.

No it isn't. There are plenty of CPU-bound tasks that run much faster if the work is distributed in parallel across multiple machines. We use Hadoop at my company primarily for that reason.

e12e9y ago

Can you say something more about your workload? Ever since dipping my feet in the MPI pool, I've wondered what kind of problems really lend themselves to running in parallell across multiple machines these days - assuming a single machine has at least 32 hardware threads.

I know there are modelling work loads, like weather forecasting and analysing seismic data - but curious what kind of work you are doing?

fat0wl9y ago

This is true but keep in mind that you have to set up a pipeline for collecting the data into HDFS (Storm? batch loading?) & you have to pay for the machines.

So while your analysis is valid, there are more "costs" at play like developer time, cluster maintenance, hardware. I like to play with Spark's ML libraries but am wary about designing projects specifically around them because of this overhead, especially when trying to distribute some API/tech that you'd like others to use.

Not trying to be a downer, I actually wish the choice to go distributed was more of a no-brainer, hah. Would love for some APIs to emerge that could be used locally/distributed transparently without actually having to run a dummy cluster & data migration to run locally.

eternalban9y ago

I've [had] this conversation with clients, CTO level, mostly in context of microservices. A few observations:

- Peter Principle: most decision makers are/feel technically insecure in the blog driven tech age, and cave in to direction from below. Of course, young developers want to play with shiny new things (given the general drudgery of the work involved).

- Emergence of DevOps: Engineers are being commoditized. There is an undeniable deskilling that goes hand in hand with having to wear all the technical hats. (A side glance here to pattern of deskilling of pilots in the age of fly by wire.) Sure, you will need to learn new 'tools' as 'operators', but what's the vote HN: what percent of these engineers could actually build one of these distributed systems? (To say nothing of being able to reign in the asynchronous distributed monster when it starts hitting its pain points.)

- You're not Google: I'm rather blunt when a team points to "Google does it". Google and the like have made a virtue out of necessity. Google/Facebook/Netflix/etc. had to resort to the pattern of lots of disposable commodity boxes. They also have the chops in house to field SREs that are simply not going to play machine room operator for enterprise IT.

The overwhelming majority of systems out there can run on a deployment scheme that 1:1 matches the logical diagram (x2 for fail over). And yes, it is amazing what one can do on a single laptop these days.

btilly9y ago

I have long offered the following advice.

If you have code that is not able to run any more in a scripting language, and it is not embarrassingly parallel, you have two choices.

1. Move to something like C++, and optimize the heck out of it. You will gain something like 1-2 orders of magnitude in performance and then hit a wall.

2. Move to a distributed architecture. You immediately lose 1-2 orders of magnitude in performance, but then can scale essentially forever.

If you expect your distributed system to need less than 100 machines, you should seriously consider option #1.

chubot9y ago

I would call those options #2 and #3. Option #1 is:

1. Parallelize on a single machine. Most scripting languages have single-threaded bottlenecks (Python, Ruby, node.js, etc.), so this means using multiple processes. xargs -P goes a long way. The coding changes are essentially a subset of what you need to distribute your program anyway.

32x or 64x speedup is nothing to sneeze at on a modern machine. The difference between 5 minutes and 5 hours usually solves your problem, practically speaking. And this means you don't have to touch every line of code, as you would if you were doing a C++ rewrite.

But also don't forget that you can often rewrite 10% of your code in C++ and keep the other 90% in Python, and get a 10x speedup. This requires fairly deep understanding of both your program and of the Python/C interface. It helps to adopt a data flow style so you are not crossing the boundary a lot. And make sure you release the GIL, and consider starting threads in C++ rather than in Python, etc.

I've also optimized an R program in C++ and gotten 125x speedup -- and that's single threaded C++; multithreaded would be another 2 orders of magnitude!!! But it also involved fixing a bunch of R performance errors, so don't underestimate that too.

In practice, any program which you actually care enough about to rewrite for speed usually has some application-level performance bugs -- i.e. slowness unrelated to the slowness of the underlying language platform.

People scale out to many machines because they don't want to rewrite every line of code. But the first step is to "scale out" by using all the cores on a single machine. In some sense, this is the best of both worlds, because you are incurring neither the overhead of distribution (serialization and networking) nor the complexity of distributed error handling.

btilly9y ago

That is a worthwhile option, but parallelization hasn't generally offered me nearly that much of a win when I'm limited by memory access performance.

In my experience, you're best off optimizing relatively limited pieces of a system rather than big applications. And only consider the rewrite when you've looked at optimizing it in place. This means that the thing that needs to be optimized with a rewrite often has a chance of not having any giant stupid mistakes.

For example see http://bentilly.blogspot.com/2011/02/finding-related-items.h... for a case where I found a thousandfold speed increase by rewriting from SQL to C++. As much as was reasonably possible, I did not change the basic algorithm.

chillacy9y ago

> and it is not embarrassingly parallel

Sounds like you had a problem that was "embarrassingly parallel". It's not going to be as simple if you have graph problems which require lots of IPC in python/ruby/v8. That's where moving to a language which has threading support can help.

ap222139y ago

I think the missed point is that Spark is very easy. I can get an average Java or Python developer trained up on it in less than a day. The python shell is very simple to use out of the box. And, it's incredibly convenient to be able to either run locally or on a huge cluster. I can use the same code to easily process batch jobs from 1 MiB to 100 TiB. In my mind, it's just a cost savings. Developer time is expensive, and it's hard to find great developers. Hardware is cheap.

No way am I a scalability expert, and I really don't have time to be one. I started using Spark when I had to sort 10 TiB on disk, and it scored the highest on sorting performance. I struggled with implementing a fast disk sort quickly, and I gave Spark a whirl, and it fixed my problem, fast. Since then, I've found it useful in a lot of other ways.

Eugr9y ago

I agree with author that [in most cases] you don't need distributed processing for your algorithms. But sometimes you do, and when you do need it you have to understand that there is no silver bullet.

Creating a distributed system is very difficult, even when using platforms like Spark. Not all algorithms can be scaled easily or scaled at all, and not all algorithms in Spark MLLib or GraphX are actually designed to be truly distributed or work equally well for all use cases/data.

We tried to implement one of our algorithms (written in Java) that was taking hours on a single machine (even when using all the cores) using methods from Spark MLLib just to find that Spark job was constantly crashing. Turned out that some of the functions just fetch all the data to the "driver" instance and calculate the result there.

My guess this is what happens with author's use case - yes, he ran it on Spark, but only one node ended up crunching all numbers. And/or network overhead of course.

After we found out that MLLib can't give us what we need, we reimplemented it from scratch in Scala, making sure we balance the load equally and don't have too much network (shuffle) traffic between the nodes.

As a result, we went from 2.5 hours on a single machine, to under 2 minutes on a cluster of 25 instances (same Ivy Bridge processor, just more cores per node). The algorithm scaled almost linearly, but it required carefully designing it with Spark specifics in mind.

Eridrus9y ago

NLLib is ostensibly open source, why not improve it? Was your final solution too specialized?

Eugr9y ago

Yes, we ended up implementing just project specific parts, not generic enough for MLLib contribution...

grayrest9y ago

If you're interested in reading more, Frank moved his blog to a github repo:

https://github.com/frankmcsherry/blog

dikaiosune9y ago

The author recently gave a talk at a Rust meetup about similar things:

https://air.mozilla.org/bay-area-rust-meetup-may-2016/

wueiued9y ago

I think graph operations are not fair comparison. It is notoriously difficult to scale.

On other side AWS now offers 2TB RAM machine. And single huge machine has smaller per GB cost than several smaller machines. I think clustered computing as we know will be soon gone. Only reason for multiple machines will be availability.

brianwawok9y ago

Do you think our datasets will stop growing? It seems to me that data is growing faster than RAM, and has been for years. How do we find the upper limit of data? The Human Genome is finite, it will only get so big. What you did on facebook? Seems near infinite....

vidarh9y ago

I'm sure the upper end of our datasets won't stop growing for the foreseeable future. But a huge proportion of problems has growth rates well below the growth rate of RAM.

And for that matter, even when we can't stuff it in RAM, the boundaries of what we can do on a single server is also constantly pushed back thanks to SSDs. It's just a few years ago since I was unable to get read speeds of more than 6GB/sec out of a RAM disk. Today I have servers that easily do 2GB/sec out of NVMe SSDs.

It's not that we never need to go beyond a single server. But people often really have no concept of when they'll need to.

MichaelGG9y ago

For businesses? Sure there's gonna be a lot of limits. Only 7 billion humans so that sorta limits any user tables. Only so many things those people can buy each day, so that limits your orders table.

VISA does what, 150M transactions per day? I was doing more volume than that for telephone calls a couple years ago with a $5K server and a default install of MSSQL. (Full ACID, updating balances -- yes I know a CC tx is probably heavier than a VoIP call but still.) At 4KB per tx, VISA could use VoltDB and store all tx's in RAM for a week for like a million or two.

An 150M a day today is not that much more than it was 15 years ago it seems. (50-100% more?).

For many, many, businesses, data size just isn't an issue any more cost wise, and soon won't be a technical challenge at all either. Yet 10, 20 years ago, we're talking 8-digit+ implementations.

Sure there's some things that grow faster, like all this increased tracking. But in general?

collyw9y ago

I can't remember where it was I saw it recently, but it was pointed out that there are diminishing returns on using more data. Your accuracy with something like Google's pagerank is good with 1 billion data points for input, but doesn't improve much with more data.

dang9y ago

Discussed at the time: https://news.ycombinator.com/item?id=8901983.

colin_mccabe9y ago

Big data systems aren't about squeezing every drop of performance out of the hardware. They're about being able to scale your solution up effortlessly. Having a uniform programming model so that you don't have to rewrite your solution N times when you go from a small test dataset to production data, to 1000x the data. It's also about using a system that other people use, so that you can actually hire admins and service companies with experience in maintaining your solution.

kibwen9y ago

Note that this is from January 2015 and thus predates the stable Rust 1.0 release in May 2015, so it's possible that the code examples do not compile on post-1.0 Rust.

theanomaly9y ago

Thanks for the analysis -- it is good that people have this context in their heads when designing systems. The missing conversation from this article is that some people conflate scalability with performance. They are different, and you absolutely trade one for the other. At large scale you end up getting performance simply from being able to throw more hardware at it, but it takes you quite a while to catch up to where you would have been on a single machine.

This is true not just for computing algorithms, but for developer time/brain space as well. Single-threaded applications are far simpler to understand.

The takeaway shouldn't be "test it on a single laptop first", but rather "will the volume/velocity of data now/in the future absolutely preclude doing this on a single laptop". At my work, we process probably a hundred TB in a few-hour batch processing window at night, Terabytes of which remain in memory for fast access. There is no choice there but to pay the overhead.

1 more reply

dzdt9y ago

This reminds me of the "your data fits in ram" website which was on HN last year. Basically that site asked you your data size, then answered "yes" for any size up to a few TB.

The website is down, but the HN discussion is still there : https://news.ycombinator.com/item?id=9581862.

In fact the top comment there links to the original post here.

herge9y ago

> The website is down,

Maybe they ran out of ram?

psiclops9y ago

I was at mozilla for your talk!! Very interesting stuff

eva19849y ago

I bet the author didn't count the account of time of downloading data to a single box. Scalability, sometimes, is not a choice.

chubot9y ago

I really want a content-addressed storage system with differential compression to solve this problem.

I was browsing through dat for awhile, but haven't caught up with it lately:

https://github.com/maxogden/dat

Basically disk is so cheap that you should just keep 2 or 3 copies of your data around. And then you can sync them really quickly and do the processing on any one of N machines.

j / k navigate · click thread line to collapse

77 comments

3pt141599y ago

I've been doing data analysis and machine learning client work for quite some time now and for companies as small as a 3 person startup to advising a department of the Canadian government.

brianwawok9y ago

This is kind of the same argument as microservices.

We could write 30 microservices deployed on 30 docker images with load balancing and FT and all that magic for a basic webapp...

Or we could just write a pretty fast webserver and do it with 1 server. (Or if it is stateless, do it with a few for still a lot less work than a giant microservice cluster).

ben_jones9y ago

People forget that the two methods of scaling are HORIZONTAL and VERTICAL. They think: "I can just put some micro-services behind HA proxy and boom, more capacity!".

2 more replies

sigusr19y ago

If you haven't seen stackoverflows server architecture posts you'd probably enjoy them: https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...

2 more replies

geofft9y ago

See also https://circleci.com/blog/its-the-future/, which is almost too true to be funny.

tracker19y ago

That was the plan. (I won't go into the political bs at a company I no longer work with).

Bromskloss9y ago

> throwing a $10k a month server or mathematician at the problem

Do you have any examples of problems at which you have thrown a mathematician or could imagine doing so?

ChuckMcM9y ago

I like the analysis, basically it says "hey you don't have big data" :-) but that requires a bit more explanation.

You have "paid" for the overhead of parallelization by trading an I/O cost of an hour for an I/O cost of about 4 seconds.

Joeri9y ago

Still, I think the general rule applies that if you can buy a server that will fit your dataset into RAM, probably you don't need something like Hadoop.

ChuckMcM9y ago

JoachimSchipper9y ago

1 more reply

bane9y ago

"RAM is for the stack and SSDs are for heap" is a phrase I'm starting to hear.

nostrebored9y ago

How about when that operation is at the core of your business logic and happens multiple time a day? How about when it's a blocking operation?

There are plenty of cases where your rule doesn't hold true, and I don't think that they're that uncommon.

Plus having many machines has the advantage of a higher granularity of control over workload and balancing.

1 more reply

scott_s9y ago

sijoe9y ago

[old HPC guy here, you kids get off my lawn!]

... well, then there is the issue of distributing that 2TB data set. I'll get to the Amdahl's law issue in a moment.

In reality, you cant. We have customers using PB of data for their analytics. Even across 1000 nodes, thats still TB scale.

Ok, use 10GbE. You'll actually get about 2-4 GbE speeds, but ok, So maybe 5-10k seconds to move your data. And your run is deeply in the noise, at 4 seconds.

Still not good.

Our thesis (ok, tooting our horn now) is that systems architecture matters for high performance large data analytics. Seymour Cray's statement about 2 strong oxen vs 1024 chickens is apt around now.

Cheap and deep are great for non-intensive data motion and analytics. Not so much for very data intensive analytics.

Again, I am biased, as this is what my company does.

justinsaccount9y ago

> 1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution.

I find your statements confusing.

My experience with HPC systems is more "Jobs are paused because the shared filesystem is unavailable"

1 more reply

e12e9y ago

Which company do you work for? (unless it's a secret for some reason)

1 more reply

cle9y ago

> The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory.

vonmoltke9y ago

> The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory.

No it isn't. There are plenty of CPU-bound tasks that run much faster if the work is distributed in parallel across multiple machines. We use Hadoop at my company primarily for that reason.

e12e9y ago

I know there are modelling work loads, like weather forecasting and analysing seismic data - but curious what kind of work you are doing?

fat0wl9y ago

This is true but keep in mind that you have to set up a pipeline for collecting the data into HDFS (Storm? batch loading?) & you have to pay for the machines.

eternalban9y ago

I've [had] this conversation with clients, CTO level, mostly in context of microservices. A few observations:

btilly9y ago

I have long offered the following advice.

If you have code that is not able to run any more in a scripting language, and it is not embarrassingly parallel, you have two choices.

1. Move to something like C++, and optimize the heck out of it. You will gain something like 1-2 orders of magnitude in performance and then hit a wall.

2. Move to a distributed architecture. You immediately lose 1-2 orders of magnitude in performance, but then can scale essentially forever.

If you expect your distributed system to need less than 100 machines, you should seriously consider option #1.

chubot9y ago

I would call those options #2 and #3. Option #1 is:

btilly9y ago

That is a worthwhile option, but parallelization hasn't generally offered me nearly that much of a win when I'm limited by memory access performance.

chillacy9y ago

> and it is not embarrassingly parallel

ap222139y ago

Eugr9y ago

My guess this is what happens with author's use case - yes, he ran it on Spark, but only one node ended up crunching all numbers. And/or network overhead of course.

Eridrus9y ago

NLLib is ostensibly open source, why not improve it? Was your final solution too specialized?

Eugr9y ago

Yes, we ended up implementing just project specific parts, not generic enough for MLLib contribution...

grayrest9y ago

If you're interested in reading more, Frank moved his blog to a github repo:

https://github.com/frankmcsherry/blog

dikaiosune9y ago

The author recently gave a talk at a Rust meetup about similar things:

https://air.mozilla.org/bay-area-rust-meetup-may-2016/

wueiued9y ago

I think graph operations are not fair comparison. It is notoriously difficult to scale.

brianwawok9y ago

vidarh9y ago

I'm sure the upper end of our datasets won't stop growing for the foreseeable future. But a huge proportion of problems has growth rates well below the growth rate of RAM.

It's not that we never need to go beyond a single server. But people often really have no concept of when they'll need to.

MichaelGG9y ago

For businesses? Sure there's gonna be a lot of limits. Only 7 billion humans so that sorta limits any user tables. Only so many things those people can buy each day, so that limits your orders table.

An 150M a day today is not that much more than it was 15 years ago it seems. (50-100% more?).

For many, many, businesses, data size just isn't an issue any more cost wise, and soon won't be a technical challenge at all either. Yet 10, 20 years ago, we're talking 8-digit+ implementations.

Sure there's some things that grow faster, like all this increased tracking. But in general?

collyw9y ago

dang9y ago

Discussed at the time: https://news.ycombinator.com/item?id=8901983.

colin_mccabe9y ago

kibwen9y ago

Note that this is from January 2015 and thus predates the stable Rust 1.0 release in May 2015, so it's possible that the code examples do not compile on post-1.0 Rust.

theanomaly9y ago

This is true not just for computing algorithms, but for developer time/brain space as well. Single-threaded applications are far simpler to understand.

1 more reply

dzdt9y ago

This reminds me of the "your data fits in ram" website which was on HN last year. Basically that site asked you your data size, then answered "yes" for any size up to a few TB.

The website is down, but the HN discussion is still there : https://news.ycombinator.com/item?id=9581862.

In fact the top comment there links to the original post here.

herge9y ago

> The website is down,

Maybe they ran out of ram?

psiclops9y ago

I was at mozilla for your talk!! Very interesting stuff

eva19849y ago

I bet the author didn't count the account of time of downloading data to a single box. Scalability, sometimes, is not a choice.

chubot9y ago

I really want a content-addressed storage system with differential compression to solve this problem.

I was browsing through dat for awhile, but haven't caught up with it lately:

https://github.com/maxogden/dat

Basically disk is so cheap that you should just keep 2 or 3 copies of your data around. And then you can sync them really quickly and do the processing on any one of N machines.

j / k navigate · click thread line to collapse