Your data fits in RAM (opens in new tab)

(yourdatafitsinram.com)

317 pointslukegb11y ago224 comments

224 comments

It's probably worth extending "Your data fits in RAM" to "Your data doesn't fit in RAM, but it does fit on an SSD". So many problems will still work with quite reasonable performance when using an SSD instead. By using a single machine with an array of SSDs, you also avoid the complexity and overhead of distributed systems.

My favourite realization of this: Frank McSherry shows how simplicity and a few optimisations can win out on graph analysis in his COST work. In his first post[1], he shows how to beat a cluster of machines with a laptop. In his second post[2], he applies even more optimizations, both space and speed, to process the largest publicly available graph dataset - terabyte sized with over a hundred billion edges - all on his laptop's SSD.

[1]: http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...

[2]: http://www.frankmcsherry.org/graph/scalability/cost/2015/02/...

cornellphds11y ago

This is classic case of "Algorithm/Problem Selection" if your algorithm/problem is tailored to a task such as PageRank, surely a single threaded highly optimized code will beat a cluster designed for ETL tasks. In real organizations where there are multiple workflows/algorithms, distributed systems always win out. Systems like Hadoop take care of Administration, Redundancy, Monitoring and Scheduling in a manner that a single machine cannot. Sure you can "Grep" faster on a laptop than AWS EMR with 4 Medium instances, but in reality where you have 12 types of jobs which are run by team of 6 people, you are much better off with a distributed system.

acqq11y ago

Ditto for computationally intensive work: if it is CPU dominated, more CPU's calculating in parallel will be of advantage, even if the data could fit some RAM.

There's no a single simple answer, but sure, whenever less computers are enough, less should be used.

The recent problem is, some people love "clouds" so much today that they push there the work that could really be done locally.

5 more replies

Retric11y ago

Some things take 1 second on 1 machine beefy machine or 6 hours on a cluster due to latency issues. And yes I do mean 20,000 times slower, though 2,000 is far more common due to latency inside a datacenter being around ~500,000 ns vs ~100 ns for main memory vs 0.5 nm from L1 cache.

PS: Not that most systems are built around these kinds of edge cases, but 'just use a cluster' is often not a good option unless each node is sufficiently beefy.

vardump11y ago

A step between RAM and SSD could be "Your data fits in RAM in compressed form". LZ4 compression is takes 3-4x longer than memcpy. LZ4 decompression is only 50% [1] slower. 2-3 GB/s per core.

[1]: Your mileage may vary.

bpicolo11y ago

If it's slower both in and out what's the benefit?

To guy below me: Ah, thanks. I thought the guy above was trying to say it's slower than paging to disk. : )

2 more replies

threeseed11y ago

Scalable graph systems have been around for many years and have never taken off.

Most businesses doing big data (like ours) often have multiple disparate data sources that at the start of the pipelines are ETLing into some EDW. Trying to consolidate them into a single integrated view is very difficult and time/resource intensive. Having billions of disconnected nodes in the graph would be very hard to reason with.

stephengillie11y ago

I feel compelled to point out that the average SSD has an order of magnitude (or maybe 2 orders) more IOPS than a 6-disk 15k RAID6 or RAID10 array.

And that's a single, standalone, non-RAIDed SSD. When you get a 6-SSD RAID10, magic starts to happen. And if you RAID enough SSDs (10-20?), you can theoretically start to get more bandwidth than you do with RAM.

xjia11y ago

sounds good to me, but why people not doing that? ssd price too high?

4 more replies

SwellJoe11y ago

So, let's say my system is currently backed by MySQL or PostgreSQL, and that is not fungible. How would one move that data into RAM, including writes? And, how would one maintain some level of safety in the event of a crash? i.e. I don't really care if I lose X amount of time worth of data (say, five minutes), but I do care that when I reboot the system, the database comes back from disk into RAM in a consistent state.

Is there some off-the-shelf solution to this problem? And, if so, why isn't it talked about more? Every CMS ever, for example, would be very well-served by something like this. My entire website's database, all ~100k comments and pages and issues and all 60k users, is only 1.4GB, and performance is always a problem. I don't care if I lose a couple minutes worth of comments in the event of a system reboot or crash. So, why can't I just turn that feature (in-memory with eventual on-disk consistency, or whatever you'd want to call it) on and forget about it?

jandrewrogers11y ago

If you are using PostgreSQL, your data is operating in RAM if your data is small enough to fit. The write-ahead log gets a sequential copy of writes that are immediately moved to disk but the rest is lazily applied in the background.

Given a halfway competent I/O scheduler and some cheap SSDs, you can continuously write new data to disk at network wire speed even at 10 GbE while operating on the data in RAM and saturating outbound network. There is no slowdown at all. Even for databases that do not implement a good I/O scheduler (like PostgreSQL unfortunately) your workload is sufficiently trivial that backing it with SSD should have no performance impact. If you are having a performance problem with 1.4GB CMS, it is an architecture problem, not a database problem.

defen11y ago

Postgres's default settings are pretty bad. You'll want to increase shared_buffers to somewhere between 25%-40% of your available system RAM (if you're using 9.2 or earlier, you'll probably need to update SHMMAX using sysctl) instead of the default 128MB.

For the thing you described, you probably also want synchronous_commit=off. That means you might lose some commits in the case of a crash, but you won't get data corruption from it, and writes will be much faster.

vidarh11y ago

If performance is a problem with a 1.4GB database, your system is either extremely low end, or you have some seriously un-optimised queries or database architecture and/or should take a long hard look at what caching you are not doing that you should be (including in the form of materialised views)

It simply is not your overall disk IO capacity that is the performance problem with a dataset that small. At least not for a CMS.

SwellJoe11y ago

So, my question wasn't about whether the system in question is shitty. I know that it's shitty. I also know that the CMS has less than optimal queries. That was also not my question.

"It simply is not your overall disk IO capacity that is the performance problem with a dataset that small."

But, it clearly, and measurably is; there's nothing to argue about there. That which comes from RAM (reads) is fast, that which waits on disk (writes) is slow. Writes take several seconds to complete, thus users wait several seconds for their comments and posts to save before being able to continue reading. That sucks, and is stupid and pointless, especially since it's not even all that important that we avoid data loss. A minute of data loss averages out to close enough to zero actual data loss, since crashes are so rare.

1 more reply

krakensden11y ago

> So, let's say my system is currently backed by MySQL or PostgreSQL, and that is not fungible. How would one move that data into RAM,

MySQL and PostgreSQL will totally take advantage of all of the RAM you give them.

> including writes?

This is harder, and you might not want it? It's worth noting that this argument is almost certainly directed against things like Hadoop, which claim to trade off performance for low management and easy scalability.

There's also a bunch of databases aimed at this use case (http://en.wikipedia.org/wiki/List_of_in-memory_databases), but I don't have any experience with them.

> I don't really care if I lose X amount of time worth of data (say, five minutes),

MySQL has generally got your back in the 'less safety for more performance' arena:

http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.htm...

SwellJoe11y ago

"MySQL has generally got your back in the 'less safety for more performance' arena:"

So, this was never entirely clear to me, but now that I've read a bit more about it, this might actually be exactly what I want (which is to not have the system wait to return when posting new content, and just assume it'll end up on disk eventually). The talk of not being ACID made me nervous and maybe switched off my brain. I guess it just means I don't need or want ACID in this case, all I want is a consistent database on reboot.

So, I guess maybe this does what I want, but just to be clear: In the event of data loss, the database will still be consistent, correct? i.e. we'll lose one or more comments or written pieces of data, but the transaction it was wrapped up in won't be half finished or something in the database? (I recall MySQL had issues with this kind of thing in the very distant past, but I imagine that's just bad memories at this point.)

1 more reply

dirtyaura11y ago

I've been thinking this too.

For example, it is very easy to write badly performing code using ORMs. And yet, ORM is often chosen for good reasons initially to give development speed (e.g. Django forms) The problem with quickly prototyped ORM based apps is that initially the performance is good enough but when the data grows the amount of queries goes through the roof. It is not the amount of data per se, but number of queries. Fixing these performance problems afterwards for small customer projects is often too expensive for the customer, but if there were a plug'n'play in-memory SQL cache/replica with disk-based writes, it would easily handle the problem for many sites.

Configuring PostgreSQL to do something like reading data from in-memory replica is likely possible, but I see that there would be value in plug'n'play configuration script/solution.

drzaiusapelord11y ago

>and performance is always a problem.

Have you proven that your performance issues are SQL related? If you configure mysql correctly and give it enough RAM, a lot of those queries are happily waiting for you in RAM, so you have a defacto RAM disk. Finding your bottleneck in a LAMP based CMS system is fairly non-trivial. Think of all the php and such that runs for every function. Its incredible how complex WP and Drupal are. Lots and lots of code runs for even the most trivial of things.

This is why we just move up one abstration layer and dump everything in Varnish, which also puts its cache in RAM. Drupal and WP will never be fast, even if mysql is. Might as well just use a transparent reverse proxy and call it a day.

SwellJoe11y ago

Disk writes are the problem in my case, and I'm comfortable making that assertion (I'm not new to this particular game). Reads are plenty fast, it's writes that are a problem. Certainly, it is a pathological problem with the disk subsystem that makes it suck, but it does suck, and if I could flip a switch and say "stop waiting for the disk, write it whenever you can" without risking breaking consistency on crashes, I would do so.

It seems like, from another comment, that setting innodb_flush_log_at_trx_commit to a non-1 value is roughly what I want, though it still flushes every second, which is probably more often than I need, but may resolve the problem of the application waiting for the commit to disk, which is probably enough.

pjc5011y ago

In-memory database on a ramdisk, replicating to an on-disk database. All reads answered from the RAM master, while the replica only has to answer writes.

But I suspect that it's all already in the disk cache. Is the bottleneck reads or writes?

SwellJoe11y ago

I guess I didn't mention it, but obviously (or at least, obvious to me knowing my system), it is writes that are slow. Reading pages can all come from memory, as the whole database is cached in RAM (I've given MySQL plenty of space to store the whole site in memory). But, writing a new post or page can take up to 30 seconds or so, which is simply absurd.

There is something pathological about our disk subsystem on that particular system, which is another issue, but it has often struck me as annoying that I can't just tell MySQL or MariaDB, "I don't care if you lose a few minutes of data. Just be consistent after a crash."

1 more reply

fat0wl11y ago

I feel like this is already indirectly possible with cache systems... yes there is an initial load from DB or whatever, but with Ehcache for example I think(?) all those objects just sit in the JVM, and therefore should be in RAM. If you wrote some app startup batch process to stick every object possible into the cache proactively, I think you'd have essentially what you're asking for.

http://ehcache.org/documentation/2.6/configuration/cache-siz...

jbergens11y ago

Yes, there are a few systems which stores info in memory but backs it up to disk. Look at VoltDb, Aerospike, Couchbase (to a degree), Starcounter, OrigoDb etc. That said for a CMS the structure is usually the problem unless you have billions of web pages which means a db with a flexible datamodel is probably what you want (ie non-relational), no matter where it stores the info. For exmple graph database should match that niche really well.

acqq11y ago

100 K comments is actually miniscule. I guess you don't use any caching if the performance is always a problem? And I'm rather sure your website has even more issues compared to a better optimized one. It can be an issue of what you pay for, of course, if you have some virtual hosting with just a little of RAM. That is, maybe your data "doesn't fit in the RAM" you are providing right now.

SwellJoe11y ago

Of course caching is enabled. Reads are all coming from cache. Performance problems come on writes, which currently wait for the disk, always, even though I don't care if they do...I'd be happy with it writing to disk five minutes later, but that's not an option I've been able to find.

2 more replies

mhaymo11y ago

Does your RDBMS's built-in caching not handle this pretty well? Just up the cache size, e.g. in PostgreSQL changing effective_cache_size https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv...

SwellJoe11y ago

For reads, yes. It's fine on reading data. The problems are writes which always wait for disk consistency to return.

nierman11y ago

shared_buffers is the setting you're thinking of here.

effective_cache_size should be set to a reasonable value of course, but it does not affect the allocated cache size, it's just used by the query optimizer.

doki_pen11y ago

RAM is more expensive than disk space. Most small websites want to save money, so paying extra to keep it all in RAM is probably out of the question.

sjtgraham11y ago

RAM is cheaper than people.

praseodym11y ago

And if your data doesn't fit in a single server's RAM, just buy some more and run Apache Spark [1] on them. It's an in-memory computation engine that's really nice to program for: you don't have to worry about low-level clustering details (like MapReduce). And it's way (10-100x) faster than Hadoop.

[1] https://spark.apache.org

threeseed11y ago

Spark is fast becoming the default tool for big data.

The recent addition of SparkR in 1.4 means that now data scientists can leverage in memory data in the cluster that has been put there by output from either Scala or DW developers.

Combine it with Tachyon (http://tachyon-project.org) and it's not hard to imagine petabytes of data all processed in memory.

studentrob11y ago

Can you explain what Tachyon does that's different from what Spark already provides?

I haven't used either Spark or Tachyon. I thought the Spark solution was to just put my dataset in memory. But the Tachyon page seems to say the same thing

2 more replies

lukegbOP11y ago

Inspired by https://twitter.com/garybernhardt/status/600783770925420546

mosselman11y ago

Can someone explain in a bit more detail what this is about? Is the 'joke' that running data computation in RAM is faster than what? From disk?

JonnieCache11y ago

The subtext is that running a fancy distributed system is more exciting and beneficial for ones resume than simply buying a massive bloody server and putting postgres on it, and that people are making tech decisions on this basis.

2 more replies

isp11y ago

1 more reply

jordanthoms11y ago

There is no point deploying a heavy, complex (and usually pretty slow due to the overheads involved) distributed database, when you could just buy a server with xTB ram, load any sql database on it, and run your queries in a fraction of the time. If your data is so large that it can't fit in the RAM of a single machine, then distributed databases make more sense (since loading data off disk is very slow, modulo SSD).

1 more reply

learnstats211y ago

Data that fits in RAM doesn't need any "Big Data" solutions.

icebraining11y ago

I believe it's more "no, you don't need an Hadoop cluster of 20 machines, your data fits in the RAM of one machine".

feld11y ago

People will build gigantic compute clusters with expensive storage backends when their entire dataset fits in memory.

If it fits in memory, it's going to be magnitudes faster to work with than on any other infrastructure you can build.

So the trick is, you take their "big data problem" and hand them a server where everything can be hot in memory and their problem no longer exists.

JDDunn911y ago

Right, RAM an order of magnitude faster than disk, so calculations will be performed very quickly. Big data usually implies clusters of servers because the data won't fit on one server (even on the disk).

1 more reply

emodendroket11y ago

Most people who think they have "big data" problems actually don't have "big" data at all.

jacquesm11y ago

He's selling himself short.

rm99911y ago

Yes! As someone who frequently runs memory-intensive algorithms on large(ish) datasets, I have a hard time explaining to many technical people that moving from a single server to a cluster increases complexity and cost by an incredible amount. It affects key decisions like algorithm and language, and generally requires a lot of tweaking.

When a problem becomes big enough, moving to a cluster is absolutely the right decision. Meanwhile, RAM is cheap and follows Moore's Law.

FooBarWidget11y ago

Complexity, sure. But cost? I thought a single 1 TB RAM server is more expensive than 10x 100 GB RAM servers.

And many people don't want to deal with physical hardware. Dealing with physical hardware increases operational complexity too. They want to rent a virtual/cloud server. Which provider allows you to rent a virtual server with 1 TB RAM?

mtabini11y ago

Cost follows complexity, even if it's not always immediately obvious.

A 1TB RAM server is more expensive than 10x 100GB RAM servers, but the hardware cost is often small compared to the business and technical cost of getting a solution to scale across a cluster.

Of course, generalizations are always dangerous—the take-home point here is perhaps that before going to a cluster because “that's the way big data is handled,” it's a good idea to do a proper cost-benefit analysis.

1 more reply

e12e11y ago

I took a look around for "high-ram" servers, and it seems one I can buy today, is HP ProLiant DL580 Gen9. With just 256 GB of ram, it clocks in at 540.995,- NOK (71.5k USD). It has 96 ram slots, and I can't seem to find anything bigger than 32 GB DDR4 RAM, and rounding the price up 96x32GB comes to roughly 672.000,- NOK (~90k USD). Adding that up (throwing away the puny ram installed), gets us to a little over double the original price, or 1.212.995,- (~161k USD). This has 4 18 core E7s (72 cores) clocked at 2.5Ghz -- and 3TB of ram (half of max, because of 32 GB dimms).

It is true that while the jump from 256GB to 3TB is "just" ~2x -- I could get a server for 1/10 of the price of the original configuration -- but only with 4GB of RAM, and nowhere near even 18 hardware threads.

If you are CPU limited (even at 72 hw threads) you might need more, smaller servers.

But such a monster should scale "pretty far" I'd say. Does cost about half as much as a small apartment, or one developer/year.

2 more replies

bildung11y ago

> Complexity, sure. But cost? I thought a single 1 TB RAM server is more expensive than 10x 100 GB RAM servers.

Sure, but in the latter case you'd also have to pay for the manpower to build a cluster solution out of a formerly simple application. And people are usually more expensive than servers.

bakhy11y ago

Apart from what other commenters already said about the cost of software complexity, is there a variant of Amdahl's law that could be used here? 10x 100 GB servers working on a problem together will probably never be 10x faster than 1x 100 GB server. Perhaps just the increase in the order of magnitude of the distance information needs to travel is already sufficient to set some higher bounds... So you may need to buy more than 10 of them to match 1x 1 TB server.

1 more reply

toast011y ago

SoftLayer lets you rent a server with 512 GB ram directly from their order form. (Monthly price for the ram is $1,444.00; a dual xeon 2620 server you can put it in is $380/month). It's baremetal, not virtual, but you can file tickets with them for any hardware stuff that comes up.

If you work outside the order form, you can get 768 GB, too. 1 TB is possible with their haswell servers, but availability seems limited.

jacquesm11y ago

10 servers with 100G will use a lot more power and will require distribution of your algorithm right along with your dataset, so instead of 10 server you will probably end up with a pretty high multiple of 10.

1 more reply

forgotAgain11y ago

Many companies started with clusters by using retired machines and free software. Going back a few years that would have been a much easier solution to put in place as compared to a capital expenditure for a single top end server.

chao-11y ago

I love it. I was just doing some Fermi estimates for a friend on the data for a project he has in the pipeline. I was curious whether or not it would be cost efficient for his project's budget to go with NVMe SSDs or have to stick with traditional SATA ones, and turns out it doesn't even matter (for now) because at least the first three months of data will fit in 256GB of RAM, even allowing for a 2.5x factor stemming from some (estimated) inefficient storage or data structure use in a scripting language like Ruby or Python.

Edit: And after those first three months he'll know more about the use and performance demands of the project and will be able to make far more accurate decisions about storage categories.

paulrosenzweig11y ago

Where's 2.5x from? I'd be curious to see any actual data on comparing memory footprint for a problem in C/Go/Rust to Python/Ruby. I'm sure it varies widely, but 2.5x might not be far off.

Dzidas11y ago

Today I'm working on dataset of 1GB, which fits in memory. But it is not enough. If a variable is category/factor you need to introduce dummy values and your dataset starts picking the weight. Next - do you want apply ML algorithm in parallel? Upst, you need more memory. Done that? Now please use test dataset for prediction. My point that "data in memory" is just the beginning...

SubuSS11y ago

The problem with giant boxes (full of RAM / SSD / Disk) is giant failures and huge recovery times. This is worsened in case of RAM because now every power blip is a full on recovery situation. Have a big enough data set focussed on a single box (or two for backup purposes), your customers are going to blow a gasket the moment one of them go down because workloads usually grow to accommodate available capacity.

FB has a nice paper that talks about this problem. https://research.facebook.com/publications/300734513398948/x...

CHY87211y ago

Well, you wouldn't run such a server without a hefty uninterruptable power supply system. On your bigger server you can expect a smaller frequency of failures due to fewer points of failure, and can make your system more resilient (rendundant RAM, filesystems, power etc).

jakozaur11y ago

More accurate title would be fit in RAM of single machine.

Maybe some bonus category:

0. Spreadsheet is all you need.

1. Python script is good enough.

2. Java/Scala is way to go.

3. Need to manage memory (gc doesn't cut), some custom organization.

4. Actually needs a cluster.

baldfat11y ago

> 0. Spreadsheet is all you need.

I HATE when people use Spreadsheets to do anything besides simple math.

http://lemire.me/blog/archives/2014/05/23/you-shouldnt-use-a...

TL:DR your work is not reproducible and we can't see what you did to get to your numbers. A million examples of why this is bad.

Also

> 1. Python script is good enough

You mean Python with pandas and numpy?

I use R which is also a great choice

> 2. Java/Scala is way to go.

For you but the vast majority of Data Scientist don't use either and their choice for people is not universal. Julia looks like a great new comer. I again mainly use R.

> 3 & 4 are good points.

dredmorbius11y ago

Sadly, I recall those arguments against spreadsheet computation being made in the early 1990s. People simply won't learn.

Ray Panko, University of Hawaii.

http://panko.shidler.hawaii.edu/SSR/

Goes back to 1993:

http://panko.shidler.hawaii.edu/SSR/auditexp.htm

1 more reply

jakozaur11y ago

Ad 0. I agree. Your article got valid point. I wouldn't do serious research based solely on complicated spreadsheet.

Though in many non-techies things, like daily sales transactions it is a way to go.

Ad 1. pandas/numpy would put it on par with 2.

Ad 2. Would disagree. I know data scientist using Spark. Mostly they like Scala API.

In general, everyone got their favorite weapon of choice and what they feel comfortable. The point is that simpler solutions sometimes are just enough do their job.

Renting r3.4xlarge on AWS for an hour and play with your favorite tool may be an orders of magnitude easier/cheaper/faster than using big data solution.

vegabook11y ago

Arguably Numpy/Pandas is just as performant as Scala/Java and it certainly beats R hands down when data becomes more than a say 10-20 gigabytes after which I find R slows to a crawl.

1 more reply

raverbashing11y ago

> you mean Python with pandas and numpy?

Actually, I would bet that some 50% of time people are importing numpy or pandas they really don't need it

Like for calculating the square root of a number. Or the average of a short list

a-saleh11y ago

I am affraid the in our research lab we didn't have 10 000$ up front/200$ a month to get a pc with 1TB ram ... we did have a large computer hall and BOINC though :)

sytelus11y ago

Looks like 1.5TB RAM with 15 cores costs $50K. But it shouldn't be just about RAM. The problems I'm working on requires 250 cores on similar amount of data. If there was an option to get say 150 cores with 2TB RAM, things would fly for sure.

jacquesm11y ago

Another 4 to 6 years and that should be a reality.

vegabook11y ago

4-6 months and you'll have a Knight's Landing Xeon Phi with at least 72 cores and 288 hardware threads, with vector instructions, and you'll be able to stick 3 of them in single blade.

falcolas11y ago

Seems a bit naive, saying 2.1PB probably doesn't fit in ram, "but it could"...

I get who this is aimed at, and why, but just saying that it fits in RAM isn't as useful as it could be. This is an opportunity to teach, not just snark.

collyw11y ago

A bit of clever reformatting of your data and 2.1 Pb could probably easily be reduced in size to something that would fit in RAM. Are you actually needing every byte?

lukegbOP11y ago

It wasn't intended to snark - I apologize if it was seen that way. I whipped this up super quick and perhaps should have expanded on my meaning.

kragen11y ago

Like, "To fit 2.1PB in RAM, you could spin up 9 r3.8xlarge EC2 instances for US$3.15 per hour"?

jkot11y ago

Outside of scale-out, scale-up there is also solution: scale-in. Optimize your memory usage, so your data occupies less space.

I work on something like that.

pedrocr11y ago

So this seems to use 6.144TiB as the limit that will fit in RAM. That's 1.536TiB x 4 when using the latest Xeon I could find[1]. According to the specs though you should be able to use 8, so the total limit should actually be 1.536 x 8 = 12.288 TiB. 12TiB of RAM, that's quite amazing.

[1] http://ark.intel.com/products/84688/Intel-Xeon-Processor-E7-...

genericuser11y ago

It seemed to use 6.000000000000000444089209850...??? TiB when I tried values.

pedrocr11y ago

It seems to use different values of cutoff depending on if you are using MiB/GiB/TiB/etc. I tested with GiB and 6144 is OK, 6145 is not.

chinpokomon11y ago

Glad to see I wasn't the only one to test the limits. ?

jerven11y ago

I think its wrong. It says 64 TB does not fit in RAM, but you can get 64TB machines from SGI as well 32 TB ones from Oracle.

The SGI one with up to 2048 cores are larger in their single system images than most people have in their clusters.

The benefit of these systems is not really the ease of programming but the speed of interconnect.

List price of the Oracle one was 3 million a few years ago. But most of that is actually in the high density dimms. These days I think the price must be lower, but I won't waste my Oracle sales contact time in figuring out what it is today. Of course it will still be expensive, it is an Oracle product after all.

However, an equivalent dell list price cluster of simple 1U boxes (512 6C/64GB ones!) will go for 1.5 million. The fact that to house 512 boxes i.e. 25 racks or so plus networking. Of course you do get 1/3rd more cores than the SGI one.

For many of us that are between the just use a single normal server and yet too small for the google solutions. These big memory solutions from Oracle and SGI can make sense even if they are not the first thing that comes to mind!

iddqd11y ago

Everything fits in RAM if you have the budget for it.

jacquesm11y ago

No, the point is that usually fitting things in RAM lowers the budget. So it's well worth doing proper analysis on whether or not you can (a) fit all your data in RAM and (b) if a cluster of machines does not become it's own reason for existence.

Replacing a large number of nodes with a single machine with a lot of RAM is usually a cost savings measure rather than a larger expense (and it saves power too!), and due to a lack of communications overhead and exploitation of the fact that you now have access to all the data in one go you may very well find that your algorithms run much faster.

A distributed solution should be a means of last resort.

bshimmin11y ago

What does 6TB of RAM go for these days?

2 more replies

nodata11y ago

That's the point of the tool! To remind people to compare the cost of fitting the data in ram compared to the cost of not putting it in ram.

rootlocus11y ago

Taken from the github repository:

var MAX_SENSIBLE = 6 * TB; function doesMyDataFitInRam(dataSize) { return dataSize <= MAX_SENSIBLE; }

br0s11y ago

And if your data doesn't fit into the RAM of a single machine you can buy a few more and use vSMP (http://www.scalemp.com/) to create a shared memory single system image.

cornellphds11y ago

In my opinion the correct answer is 255Gb. (i.e. AWS r3X8 High Memory instances ).

While one can purchase servers with larger memory most likely you will run into limitation on number of cores. Also note that there is at least some overhead in processing data, so you would need at least 2X the size of raw data.

Finally while its a good thing to tweet, joke about and make fun of buzzword while trying to appear smart. The reality is that purchasing such servers (> 255 Gb RAM) is costly process. Further you would ideally need two of them to remove single point of failure. it is likely that the job is batch and while it might take a terabyte of RAM you only need to run it once a week, in all these cases you are much better off relying on a distributed system where each node has very large memory, and the task can easily split. Just because you have cluster does not mean that each node has to be a small instance (4 processors ~16 Gb RAM).

jacquesm11y ago

> Further you would ideally need two of them to remove single point of failure.

That's assuming that everything needs to be 'high availability' and buying two of everything is a must. This is definitely not always the case. In plenty of situations buying a single item and simply repairing it when it breaks is a perfectly good strategy.

cornellphds11y ago

Its not about having two of everything at all times, but rather about having a capacity whenever you need it. At 244Gb you hit a sweet point where you can have access to large capability at a flexible price (Spot Market / On Demand / On Premise). This is what separates engineers with business acumen from run of the mill "consultants" with a search engine.

1 more reply

voidlogic11y ago

"Your data fits in RAM", vs "Your data fits in RAM on around X machines", would be better. Any dataset fits in RAM.... but if its going to take more machines then I am willing to buy it really doesn't.

karmakaze11y ago

Before core, there was tape. Tape used to be backup medium, then disk became the new tape. Bubble memory begat SSD, so memory has in some sense become the new disk.

RAM is the new disk: now for some, later for others.

yellowapple11y ago

"Yes, your data fits in RAM... if you feel like buying a server at the same price as 3 Tesla Model S automobiles, a mansion in the Southern U.S., or a bachelor pad in San Francisco."

peter30311y ago

HP hints its new memristor memory computer will have the cost of flash and the speed of registers. An will mostly eliminate the multi-level memory hierarchies we have today.

CHY87211y ago

Unlikely; the limiting factor is already distance - poor scaling from interconnects (wires) already means that we can't have all that much global state. This might increase the amount of state we can have, but unless you can fit gigabytes into a single chip you won't be eliminating the multi level memory hierarchy.

Like right now the L1 cache will have latencies of 1 or 2 cycles, and the L2 cache 15; this is due to the overheads of cache coherency protocols, moving the data around the chip; it's not that the memory's slower, it's all SRAM.

They are probably referring to enterprise workloads. Here you have large working sets (so caches are less useful) and you want maximum throughput. Clever multithreading (finegrained) can reduce effective latency by scheduling many (32?) processes at the same time, executing an instruction from each in round-robin fashion (see Sun Niagara). In that case, you can sometimes dump the L1 cache, and you would be able to get rid of the memory hierarchy.

There's also probably a benefit wrt hard drives/secondary storage; you can obviously make system storage very fast, which might improve random access times considerably. BUT this is probably not going to be transformative; it'll improve certain types of accesses, but current algorithms are already very highly tuned to spatial and temporal locality of reference. Furthermore, you'll still see these structures win out, because they can take advantage of hardware prefetching more easily.

eafpres11y ago

The property of memristors having real values instead of 0 or 1, and the fact that their value can be path dependent, leads me to think that at least information density can be increased over conventional memory today.

nickbauman11y ago

Cute but "Big Data" is really just data that's not in the building and isn't feasible to just move around from one machine to another in your department.

nwenzel11y ago

Even if your data doesn't fit in RAM... and even if it does... when you're developing, you should be using a sample of your data that fits into RAM.

swalsh11y ago

This is good marketing, but you know what would be even better marketing? Give me access to that server for a week. Let me setup a demo of my biggest customer, and then run my tasks. We've started (and are in progress) of investing thousands of dollars in moving to Azure. A server this large is not something I can buy, and experiment on easily. Hard numbers would convince my superiors that its a better solution, but they're not going to give me $10k to do the experiment.

jacquesm11y ago

That's how accidents are made. If you can't spend a small fraction of the budget for the solution to experimentally verify that it is in fact the optimum solution you may very well be leading the company down a road that will cost them significantly more. It's not up to the writer of the article to provide you with the tools to run your least-cost-analysis, that's up to you and your bosses! (After all, you're the beneficiaries.)

vegabook11y ago

10k? Those sticks of RAM alone will cost you something like 75k USD. Then you'll need the processors, arguably 4 of the top of the line 18-core XEONs at 5000 USD each. Then you'll need to put it all together with software and a (properly cooled) rack, not to mention the terminal(s) to access it, plus the personnel to put this baby together for you. This box could easily cost you 150 grand.

pquerna11y ago

Its not cost effective to use non E5-class Xeons, or go above 32GB DIMMs right now.... So you want a Dual-Processor setup, 16 DIMM slots, so 16x 32GB = 512GB w/ Dual Proc -- which you can do for about $10,000.

1 more reply

Aardwolf11y ago

If I select 1KB, why does the link point to an HP server with up to 6TB of RAM? Linking to an 80's PC seems more appropriate :)

tempodox11y ago

Wow, I wish I had the spare change for one of these beasts. I think I have enough NP-hard problems to fill any RAM to the brim :)

nwrk11y ago

http://www.downloadmoreram.com/

msellout11y ago

Although we can theoretically handle up to 2^64 bytes of RAM (16 exabytes), the practical limit is much lower. I think someone on Wikipedia said it's somewhere around 8TB, but I imagine the performance of random access into 8TB RAM is much worse than a motherboard designed for up to 32GB RAM.

It's not as easy as just buying more RAM. You'll have to pay more attention to how you make use of the various caches in between your CPU and RAM.

gambiting11y ago

I imagine that on a motherboard with 96x RAM slots, the access time between the first one in row and the last one will be actually quite different, due to the physical distance between them.

polite_wine11y ago

Sorry for the simple question but if you store it in ram what is the strategy for when the server is turned off?

lukegbOP11y ago

The idea is more that when you process data, if you can fit it all in memory (and you don't need lots of CPU power, etc, etc, etc) then just use one machine and don't worry about "clusterising" it.

If you're expecting growth in the size of your dataset (beyond growth in RAM size availability), then, well, maybe don't just use a single machine. Same goes for a whole bunch of similar "it's too large for a single machine" considerations.

Storing data should probably still be persisted to disk, and backed up.

swalsh11y ago

You turn it back on, and load it back from the hard drive.

3pt1415911y ago

There are multiple strategies that are usually handled by the database that you use. For some databases a hard power off will lose the uncommitted data, for more durable ones it waits until the write is confirmed.

Generally though, these posts are geared towards machine learning people that don't really have "live" data as frequently.

jeltz11y ago

This all depends on what the data is used for. You may need to persist the data to disk on write even if all your data is in RAM.

stupidcar11y ago

Damn. my data is 6597069766657 bytes. Apparently if it was 6597069766656 bytes it would have fitted in RAM.

lukegbOP11y ago

Well, hate to break it to you, but you probably have some overhead associated with your data, like your operating system or structures related to processing your data.

rplnt11y ago

Our data fits in ram but it proved to have no speed benefit. So the ram just sits there, being empty.

starikovs11y ago

Redis as a primary data store!

octatoan11y ago

600 PiB "No, it probably doesn't fit in RAM (but it might)."

Well, well, well.

scblock11y ago

What is the point of this site other budget shaming?

lurkinggrue11y ago

Great googly moogly! terabytes of ram!

maljx11y ago

But does it fit in the L1 cache?

itamarhaber11y ago

Brilliant!

smartpants11y ago

6.000000000000000444 TiB

mahouse11y ago

Any point on the stupidly big ass font? It does not fit in my screen.

pcthrowaway11y ago

BUT IT FITS IN RAM!

imaginenore11y ago

That's like saying "you can fly first class".

If you don't have money, you can't. Very few people can afford it.

pdpi11y ago

It's more akin to saying "if you're looking at buying several economy tickets to go from A to B, a first class ticket on a direct flight might be cheaper and faster than stitching together several economy tickets"

smegel11y ago

If you are programming in R, you sure better hope it does!

baldfat11y ago

After reading the title I was sure there was something about R in the comments.

You can program R in Spark you can now program in R http://blog.revolutionanalytics.com/2015/01/a-first-look-at-...

Now you can work directly with SQL Server as announced this week by MS. http://www.computerworld.com/article/2923214/big-data/sql-se...

I have had a ton of arguments about R's "biggest weakness" being that it uses RAM. I haven't once in the almost 3 years of working in R that I ran into this road block, but I am sure others have. Which there are several good distributed choices that will keep getting better and better.

Using RAM instead of Distributed is better in R as well as really any other language in terms of complexity and flexibility.

saosebastiao11y ago

For my workloads, R has always choked on its single thread long before it choked on memory. And the parallelism options are terrible hacks.

lessthunk11y ago

or you learn about data structures and algorithms and try to need less :-); Randomized algorithms for example are intriguing.

toolslive11y ago

We build object stores... so, no it most definitely does not.

josephmx11y ago

I sincerely hope nobody is using a tool like this to decide which enterprise servers to buy...

lukegbOP11y ago

Me too. The links are mostly to back up my claim rather than as a suggestion of servers to buy (or I'd have found some affiliate links!)

j / k navigate · click thread line to collapse

224 comments

Smerity11y ago

[1]: http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...

[2]: http://www.frankmcsherry.org/graph/scalability/cost/2015/02/...

cornellphds11y ago

acqq11y ago

Ditto for computationally intensive work: if it is CPU dominated, more CPU's calculating in parallel will be of advantage, even if the data could fit some RAM.

There's no a single simple answer, but sure, whenever less computers are enough, less should be used.

The recent problem is, some people love "clouds" so much today that they push there the work that could really be done locally.

5 more replies

Retric11y ago

PS: Not that most systems are built around these kinds of edge cases, but 'just use a cluster' is often not a good option unless each node is sufficiently beefy.

vardump11y ago

A step between RAM and SSD could be "Your data fits in RAM in compressed form". LZ4 compression is takes 3-4x longer than memcpy. LZ4 decompression is only 50% [1] slower. 2-3 GB/s per core.

[1]: Your mileage may vary.

bpicolo11y ago

If it's slower both in and out what's the benefit?

To guy below me: Ah, thanks. I thought the guy above was trying to say it's slower than paging to disk. : )

2 more replies

threeseed11y ago

Scalable graph systems have been around for many years and have never taken off.

stephengillie11y ago

I feel compelled to point out that the average SSD has an order of magnitude (or maybe 2 orders) more IOPS than a 6-disk 15k RAID6 or RAID10 array.

xjia11y ago

sounds good to me, but why people not doing that? ssd price too high?

4 more replies

SwellJoe11y ago

jandrewrogers11y ago

defen11y ago

vidarh11y ago

It simply is not your overall disk IO capacity that is the performance problem with a dataset that small. At least not for a CMS.

SwellJoe11y ago

So, my question wasn't about whether the system in question is shitty. I know that it's shitty. I also know that the CMS has less than optimal queries. That was also not my question.

"It simply is not your overall disk IO capacity that is the performance problem with a dataset that small."

1 more reply

krakensden11y ago

> So, let's say my system is currently backed by MySQL or PostgreSQL, and that is not fungible. How would one move that data into RAM,

MySQL and PostgreSQL will totally take advantage of all of the RAM you give them.

> including writes?

There's also a bunch of databases aimed at this use case (http://en.wikipedia.org/wiki/List_of_in-memory_databases), but I don't have any experience with them.

> I don't really care if I lose X amount of time worth of data (say, five minutes),

MySQL has generally got your back in the 'less safety for more performance' arena:

http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.htm...

SwellJoe11y ago

"MySQL has generally got your back in the 'less safety for more performance' arena:"

1 more reply

dirtyaura11y ago

I've been thinking this too.

Configuring PostgreSQL to do something like reading data from in-memory replica is likely possible, but I see that there would be value in plug'n'play configuration script/solution.

drzaiusapelord11y ago

>and performance is always a problem.

SwellJoe11y ago

pjc5011y ago

In-memory database on a ramdisk, replicating to an on-disk database. All reads answered from the RAM master, while the replica only has to answer writes.

But I suspect that it's all already in the disk cache. Is the bottleneck reads or writes?

SwellJoe11y ago

1 more reply

fat0wl11y ago

http://ehcache.org/documentation/2.6/configuration/cache-siz...

jbergens11y ago

acqq11y ago

SwellJoe11y ago

2 more replies

mhaymo11y ago

SwellJoe11y ago

For reads, yes. It's fine on reading data. The problems are writes which always wait for disk consistency to return.

nierman11y ago

shared_buffers is the setting you're thinking of here.

effective_cache_size should be set to a reasonable value of course, but it does not affect the allocated cache size, it's just used by the query optimizer.

doki_pen11y ago

RAM is more expensive than disk space. Most small websites want to save money, so paying extra to keep it all in RAM is probably out of the question.

sjtgraham11y ago

RAM is cheaper than people.

praseodym11y ago

[1] https://spark.apache.org

threeseed11y ago

Spark is fast becoming the default tool for big data.

The recent addition of SparkR in 1.4 means that now data scientists can leverage in memory data in the cluster that has been put there by output from either Scala or DW developers.

Combine it with Tachyon (http://tachyon-project.org) and it's not hard to imagine petabytes of data all processed in memory.

studentrob11y ago

Can you explain what Tachyon does that's different from what Spark already provides?

I haven't used either Spark or Tachyon. I thought the Spark solution was to just put my dataset in memory. But the Tachyon page seems to say the same thing

2 more replies

lukegbOP11y ago

Inspired by https://twitter.com/garybernhardt/status/600783770925420546

mosselman11y ago

Can someone explain in a bit more detail what this is about? Is the 'joke' that running data computation in RAM is faster than what? From disk?

JonnieCache11y ago

2 more replies

isp11y ago

1 more reply

jordanthoms11y ago

1 more reply

learnstats211y ago

Data that fits in RAM doesn't need any "Big Data" solutions.

icebraining11y ago

I believe it's more "no, you don't need an Hadoop cluster of 20 machines, your data fits in the RAM of one machine".

feld11y ago

People will build gigantic compute clusters with expensive storage backends when their entire dataset fits in memory.

If it fits in memory, it's going to be magnitudes faster to work with than on any other infrastructure you can build.

So the trick is, you take their "big data problem" and hand them a server where everything can be hot in memory and their problem no longer exists.

JDDunn911y ago

1 more reply

emodendroket11y ago

Most people who think they have "big data" problems actually don't have "big" data at all.

jacquesm11y ago

He's selling himself short.

rm99911y ago

When a problem becomes big enough, moving to a cluster is absolutely the right decision. Meanwhile, RAM is cheap and follows Moore's Law.

FooBarWidget11y ago

Complexity, sure. But cost? I thought a single 1 TB RAM server is more expensive than 10x 100 GB RAM servers.

mtabini11y ago

Cost follows complexity, even if it's not always immediately obvious.

A 1TB RAM server is more expensive than 10x 100GB RAM servers, but the hardware cost is often small compared to the business and technical cost of getting a solution to scale across a cluster.

1 more reply

e12e11y ago

If you are CPU limited (even at 72 hw threads) you might need more, smaller servers.

But such a monster should scale "pretty far" I'd say. Does cost about half as much as a small apartment, or one developer/year.

2 more replies

bildung11y ago

> Complexity, sure. But cost? I thought a single 1 TB RAM server is more expensive than 10x 100 GB RAM servers.

Sure, but in the latter case you'd also have to pay for the manpower to build a cluster solution out of a formerly simple application. And people are usually more expensive than servers.

bakhy11y ago

1 more reply

toast011y ago

If you work outside the order form, you can get 768 GB, too. 1 TB is possible with their haswell servers, but availability seems limited.

jacquesm11y ago

1 more reply

forgotAgain11y ago

chao-11y ago

Edit: And after those first three months he'll know more about the use and performance demands of the project and will be able to make far more accurate decisions about storage categories.

paulrosenzweig11y ago

Where's 2.5x from? I'd be curious to see any actual data on comparing memory footprint for a problem in C/Go/Rust to Python/Ruby. I'm sure it varies widely, but 2.5x might not be far off.

Dzidas11y ago

SubuSS11y ago

FB has a nice paper that talks about this problem. https://research.facebook.com/publications/300734513398948/x...

CHY87211y ago

jakozaur11y ago

More accurate title would be fit in RAM of single machine.

Maybe some bonus category:

0. Spreadsheet is all you need.

1. Python script is good enough.

2. Java/Scala is way to go.

3. Need to manage memory (gc doesn't cut), some custom organization.

4. Actually needs a cluster.

baldfat11y ago

> 0. Spreadsheet is all you need.

I HATE when people use Spreadsheets to do anything besides simple math.

http://lemire.me/blog/archives/2014/05/23/you-shouldnt-use-a...

TL:DR your work is not reproducible and we can't see what you did to get to your numbers. A million examples of why this is bad.

Also

> 1. Python script is good enough

You mean Python with pandas and numpy?

I use R which is also a great choice

> 2. Java/Scala is way to go.

For you but the vast majority of Data Scientist don't use either and their choice for people is not universal. Julia looks like a great new comer. I again mainly use R.

> 3 & 4 are good points.

dredmorbius11y ago

Sadly, I recall those arguments against spreadsheet computation being made in the early 1990s. People simply won't learn.

Ray Panko, University of Hawaii.

http://panko.shidler.hawaii.edu/SSR/

Goes back to 1993:

http://panko.shidler.hawaii.edu/SSR/auditexp.htm

1 more reply

jakozaur11y ago

Ad 0. I agree. Your article got valid point. I wouldn't do serious research based solely on complicated spreadsheet.

Though in many non-techies things, like daily sales transactions it is a way to go.

Ad 1. pandas/numpy would put it on par with 2.

Ad 2. Would disagree. I know data scientist using Spark. Mostly they like Scala API.

In general, everyone got their favorite weapon of choice and what they feel comfortable. The point is that simpler solutions sometimes are just enough do their job.

Renting r3.4xlarge on AWS for an hour and play with your favorite tool may be an orders of magnitude easier/cheaper/faster than using big data solution.

vegabook11y ago

Arguably Numpy/Pandas is just as performant as Scala/Java and it certainly beats R hands down when data becomes more than a say 10-20 gigabytes after which I find R slows to a crawl.

1 more reply

raverbashing11y ago

> you mean Python with pandas and numpy?

Actually, I would bet that some 50% of time people are importing numpy or pandas they really don't need it

Like for calculating the square root of a number. Or the average of a short list

a-saleh11y ago

I am affraid the in our research lab we didn't have 10 000$ up front/200$ a month to get a pc with 1TB ram ... we did have a large computer hall and BOINC though :)

sytelus11y ago

jacquesm11y ago

Another 4 to 6 years and that should be a reality.

vegabook11y ago

4-6 months and you'll have a Knight's Landing Xeon Phi with at least 72 cores and 288 hardware threads, with vector instructions, and you'll be able to stick 3 of them in single blade.

falcolas11y ago

Seems a bit naive, saying 2.1PB probably doesn't fit in ram, "but it could"...

I get who this is aimed at, and why, but just saying that it fits in RAM isn't as useful as it could be. This is an opportunity to teach, not just snark.

collyw11y ago

A bit of clever reformatting of your data and 2.1 Pb could probably easily be reduced in size to something that would fit in RAM. Are you actually needing every byte?

lukegbOP11y ago

It wasn't intended to snark - I apologize if it was seen that way. I whipped this up super quick and perhaps should have expanded on my meaning.

kragen11y ago

Like, "To fit 2.1PB in RAM, you could spin up 9 r3.8xlarge EC2 instances for US$3.15 per hour"?

jkot11y ago

Outside of scale-out, scale-up there is also solution: scale-in. Optimize your memory usage, so your data occupies less space.

I work on something like that.

pedrocr11y ago

[1] http://ark.intel.com/products/84688/Intel-Xeon-Processor-E7-...

genericuser11y ago

It seemed to use 6.000000000000000444089209850...??? TiB when I tried values.

pedrocr11y ago

It seems to use different values of cutoff depending on if you are using MiB/GiB/TiB/etc. I tested with GiB and 6144 is OK, 6145 is not.

chinpokomon11y ago

Glad to see I wasn't the only one to test the limits. ?

jerven11y ago

I think its wrong. It says 64 TB does not fit in RAM, but you can get 64TB machines from SGI as well 32 TB ones from Oracle.

The SGI one with up to 2048 cores are larger in their single system images than most people have in their clusters.

The benefit of these systems is not really the ease of programming but the speed of interconnect.

iddqd11y ago

Everything fits in RAM if you have the budget for it.

jacquesm11y ago

A distributed solution should be a means of last resort.

bshimmin11y ago

What does 6TB of RAM go for these days?

2 more replies

nodata11y ago

That's the point of the tool! To remind people to compare the cost of fitting the data in ram compared to the cost of not putting it in ram.

rootlocus11y ago

Taken from the github repository:

var MAX_SENSIBLE = 6 * TB; function doesMyDataFitInRam(dataSize) { return dataSize <= MAX_SENSIBLE; }

br0s11y ago

And if your data doesn't fit into the RAM of a single machine you can buy a few more and use vSMP (http://www.scalemp.com/) to create a shared memory single system image.

cornellphds11y ago

In my opinion the correct answer is 255Gb. (i.e. AWS r3X8 High Memory instances ).

jacquesm11y ago

> Further you would ideally need two of them to remove single point of failure.

cornellphds11y ago

1 more reply

voidlogic11y ago

karmakaze11y ago

Before core, there was tape. Tape used to be backup medium, then disk became the new tape. Bubble memory begat SSD, so memory has in some sense become the new disk.

RAM is the new disk: now for some, later for others.

yellowapple11y ago

"Yes, your data fits in RAM... if you feel like buying a server at the same price as 3 Tesla Model S automobiles, a mansion in the Southern U.S., or a bachelor pad in San Francisco."

peter30311y ago

HP hints its new memristor memory computer will have the cost of flash and the speed of registers. An will mostly eliminate the multi-level memory hierarchies we have today.

CHY87211y ago

eafpres11y ago

nickbauman11y ago

Cute but "Big Data" is really just data that's not in the building and isn't feasible to just move around from one machine to another in your department.

nwenzel11y ago

Even if your data doesn't fit in RAM... and even if it does... when you're developing, you should be using a sample of your data that fits into RAM.

swalsh11y ago

jacquesm11y ago

vegabook11y ago

pquerna11y ago

1 more reply

Aardwolf11y ago

If I select 1KB, why does the link point to an HP server with up to 6TB of RAM? Linking to an 80's PC seems more appropriate :)

tempodox11y ago

Wow, I wish I had the spare change for one of these beasts. I think I have enough NP-hard problems to fill any RAM to the brim :)

nwrk11y ago

http://www.downloadmoreram.com/

msellout11y ago

It's not as easy as just buying more RAM. You'll have to pay more attention to how you make use of the various caches in between your CPU and RAM.

gambiting11y ago

I imagine that on a motherboard with 96x RAM slots, the access time between the first one in row and the last one will be actually quite different, due to the physical distance between them.

polite_wine11y ago

Sorry for the simple question but if you store it in ram what is the strategy for when the server is turned off?

lukegbOP11y ago

The idea is more that when you process data, if you can fit it all in memory (and you don't need lots of CPU power, etc, etc, etc) then just use one machine and don't worry about "clusterising" it.

Storing data should probably still be persisted to disk, and backed up.

swalsh11y ago

You turn it back on, and load it back from the hard drive.

3pt1415911y ago

Generally though, these posts are geared towards machine learning people that don't really have "live" data as frequently.

jeltz11y ago

This all depends on what the data is used for. You may need to persist the data to disk on write even if all your data is in RAM.

stupidcar11y ago

Damn. my data is 6597069766657 bytes. Apparently if it was 6597069766656 bytes it would have fitted in RAM.

lukegbOP11y ago

Well, hate to break it to you, but you probably have some overhead associated with your data, like your operating system or structures related to processing your data.

rplnt11y ago

Our data fits in ram but it proved to have no speed benefit. So the ram just sits there, being empty.

starikovs11y ago

Redis as a primary data store!

octatoan11y ago

600 PiB "No, it probably doesn't fit in RAM (but it might)."

Well, well, well.

scblock11y ago

What is the point of this site other budget shaming?

lurkinggrue11y ago

Great googly moogly! terabytes of ram!

maljx11y ago

But does it fit in the L1 cache?

itamarhaber11y ago

Brilliant!

smartpants11y ago

6.000000000000000444 TiB

mahouse11y ago

Any point on the stupidly big ass font? It does not fit in my screen.

pcthrowaway11y ago

BUT IT FITS IN RAM!

imaginenore11y ago

That's like saying "you can fly first class".

If you don't have money, you can't. Very few people can afford it.

pdpi11y ago

smegel11y ago

If you are programming in R, you sure better hope it does!

baldfat11y ago

After reading the title I was sure there was something about R in the comments.

You can program R in Spark you can now program in R http://blog.revolutionanalytics.com/2015/01/a-first-look-at-...

Now you can work directly with SQL Server as announced this week by MS. http://www.computerworld.com/article/2923214/big-data/sql-se...

Using RAM instead of Distributed is better in R as well as really any other language in terms of complexity and flexibility.

saosebastiao11y ago

For my workloads, R has always choked on its single thread long before it choked on memory. And the parallelism options are terrible hacks.

lessthunk11y ago

or you learn about data structures and algorithms and try to need less :-); Randomized algorithms for example are intriguing.

toolslive11y ago

We build object stores... so, no it most definitely does not.

josephmx11y ago

I sincerely hope nobody is using a tool like this to decide which enterprise servers to buy...

lukegbOP11y ago

Me too. The links are mostly to back up my claim rather than as a suggestion of servers to buy (or I'd have found some affiliate links!)

j / k navigate · click thread line to collapse