My favourite realization of this: Frank McSherry shows how simplicity and a few optimisations can win out on graph analysis in his COST work. In his first post[1], he shows how to beat a cluster of machines with a laptop. In his second post[2], he applies even more optimizations, both space and speed, to process the largest publicly available graph dataset - terabyte sized with over a hundred billion edges - all on his laptop's SSD.
[1]: http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...
[2]: http://www.frankmcsherry.org/graph/scalability/cost/2015/02/...
There's no a single simple answer, but sure, whenever less computers are enough, less should be used.
The recent problem is, some people love "clouds" so much today that they push there the work that could really be done locally.
PS: Not that most systems are built around these kinds of edge cases, but 'just use a cluster' is often not a good option unless each node is sufficiently beefy.
[1]: Your mileage may vary.
To guy below me: Ah, thanks. I thought the guy above was trying to say it's slower than paging to disk. : )
Most businesses doing big data (like ours) often have multiple disparate data sources that at the start of the pipelines are ETLing into some EDW. Trying to consolidate them into a single integrated view is very difficult and time/resource intensive. Having billions of disconnected nodes in the graph would be very hard to reason with.
And that's a single, standalone, non-RAIDed SSD. When you get a 6-SSD RAID10, magic starts to happen. And if you RAID enough SSDs (10-20?), you can theoretically start to get more bandwidth than you do with RAM.
Is there some off-the-shelf solution to this problem? And, if so, why isn't it talked about more? Every CMS ever, for example, would be very well-served by something like this. My entire website's database, all ~100k comments and pages and issues and all 60k users, is only 1.4GB, and performance is always a problem. I don't care if I lose a couple minutes worth of comments in the event of a system reboot or crash. So, why can't I just turn that feature (in-memory with eventual on-disk consistency, or whatever you'd want to call it) on and forget about it?
Given a halfway competent I/O scheduler and some cheap SSDs, you can continuously write new data to disk at network wire speed even at 10 GbE while operating on the data in RAM and saturating outbound network. There is no slowdown at all. Even for databases that do not implement a good I/O scheduler (like PostgreSQL unfortunately) your workload is sufficiently trivial that backing it with SSD should have no performance impact. If you are having a performance problem with 1.4GB CMS, it is an architecture problem, not a database problem.
For the thing you described, you probably also want synchronous_commit=off. That means you might lose some commits in the case of a crash, but you won't get data corruption from it, and writes will be much faster.
It simply is not your overall disk IO capacity that is the performance problem with a dataset that small. At least not for a CMS.
"It simply is not your overall disk IO capacity that is the performance problem with a dataset that small."
But, it clearly, and measurably is; there's nothing to argue about there. That which comes from RAM (reads) is fast, that which waits on disk (writes) is slow. Writes take several seconds to complete, thus users wait several seconds for their comments and posts to save before being able to continue reading. That sucks, and is stupid and pointless, especially since it's not even all that important that we avoid data loss. A minute of data loss averages out to close enough to zero actual data loss, since crashes are so rare.
MySQL and PostgreSQL will totally take advantage of all of the RAM you give them.
> including writes?
This is harder, and you might not want it? It's worth noting that this argument is almost certainly directed against things like Hadoop, which claim to trade off performance for low management and easy scalability.
There's also a bunch of databases aimed at this use case (http://en.wikipedia.org/wiki/List_of_in-memory_databases), but I don't have any experience with them.
> I don't really care if I lose X amount of time worth of data (say, five minutes),
MySQL has generally got your back in the 'less safety for more performance' arena:
http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.htm...
So, this was never entirely clear to me, but now that I've read a bit more about it, this might actually be exactly what I want (which is to not have the system wait to return when posting new content, and just assume it'll end up on disk eventually). The talk of not being ACID made me nervous and maybe switched off my brain. I guess it just means I don't need or want ACID in this case, all I want is a consistent database on reboot.
So, I guess maybe this does what I want, but just to be clear: In the event of data loss, the database will still be consistent, correct? i.e. we'll lose one or more comments or written pieces of data, but the transaction it was wrapped up in won't be half finished or something in the database? (I recall MySQL had issues with this kind of thing in the very distant past, but I imagine that's just bad memories at this point.)
For example, it is very easy to write badly performing code using ORMs. And yet, ORM is often chosen for good reasons initially to give development speed (e.g. Django forms) The problem with quickly prototyped ORM based apps is that initially the performance is good enough but when the data grows the amount of queries goes through the roof. It is not the amount of data per se, but number of queries. Fixing these performance problems afterwards for small customer projects is often too expensive for the customer, but if there were a plug'n'play in-memory SQL cache/replica with disk-based writes, it would easily handle the problem for many sites.
Configuring PostgreSQL to do something like reading data from in-memory replica is likely possible, but I see that there would be value in plug'n'play configuration script/solution.
Have you proven that your performance issues are SQL related? If you configure mysql correctly and give it enough RAM, a lot of those queries are happily waiting for you in RAM, so you have a defacto RAM disk. Finding your bottleneck in a LAMP based CMS system is fairly non-trivial. Think of all the php and such that runs for every function. Its incredible how complex WP and Drupal are. Lots and lots of code runs for even the most trivial of things.
This is why we just move up one abstration layer and dump everything in Varnish, which also puts its cache in RAM. Drupal and WP will never be fast, even if mysql is. Might as well just use a transparent reverse proxy and call it a day.
It seems like, from another comment, that setting innodb_flush_log_at_trx_commit to a non-1 value is roughly what I want, though it still flushes every second, which is probably more often than I need, but may resolve the problem of the application waiting for the commit to disk, which is probably enough.
But I suspect that it's all already in the disk cache. Is the bottleneck reads or writes?
There is something pathological about our disk subsystem on that particular system, which is another issue, but it has often struck me as annoying that I can't just tell MySQL or MariaDB, "I don't care if you lose a few minutes of data. Just be consistent after a crash."
http://ehcache.org/documentation/2.6/configuration/cache-siz...
effective_cache_size should be set to a reasonable value of course, but it does not affect the allocated cache size, it's just used by the query optimizer.
The recent addition of SparkR in 1.4 means that now data scientists can leverage in memory data in the cluster that has been put there by output from either Scala or DW developers.
Combine it with Tachyon (http://tachyon-project.org) and it's not hard to imagine petabytes of data all processed in memory.
I haven't used either Spark or Tachyon. I thought the Spark solution was to just put my dataset in memory. But the Tachyon page seems to say the same thing
If it fits in memory, it's going to be magnitudes faster to work with than on any other infrastructure you can build.
So the trick is, you take their "big data problem" and hand them a server where everything can be hot in memory and their problem no longer exists.
When a problem becomes big enough, moving to a cluster is absolutely the right decision. Meanwhile, RAM is cheap and follows Moore's Law.
And many people don't want to deal with physical hardware. Dealing with physical hardware increases operational complexity too. They want to rent a virtual/cloud server. Which provider allows you to rent a virtual server with 1 TB RAM?
A 1TB RAM server is more expensive than 10x 100GB RAM servers, but the hardware cost is often small compared to the business and technical cost of getting a solution to scale across a cluster.
Of course, generalizations are always dangerous—the take-home point here is perhaps that before going to a cluster because “that's the way big data is handled,” it's a good idea to do a proper cost-benefit analysis.
It is true that while the jump from 256GB to 3TB is "just" ~2x -- I could get a server for 1/10 of the price of the original configuration -- but only with 4GB of RAM, and nowhere near even 18 hardware threads.
If you are CPU limited (even at 72 hw threads) you might need more, smaller servers.
But such a monster should scale "pretty far" I'd say. Does cost about half as much as a small apartment, or one developer/year.
Sure, but in the latter case you'd also have to pay for the manpower to build a cluster solution out of a formerly simple application. And people are usually more expensive than servers.
If you work outside the order form, you can get 768 GB, too. 1 TB is possible with their haswell servers, but availability seems limited.
Edit: And after those first three months he'll know more about the use and performance demands of the project and will be able to make far more accurate decisions about storage categories.
FB has a nice paper that talks about this problem. https://research.facebook.com/publications/300734513398948/x...
Maybe some bonus category:
0. Spreadsheet is all you need.
1. Python script is good enough.
2. Java/Scala is way to go.
3. Need to manage memory (gc doesn't cut), some custom organization.
4. Actually needs a cluster.
I HATE when people use Spreadsheets to do anything besides simple math.
http://lemire.me/blog/archives/2014/05/23/you-shouldnt-use-a...
TL:DR your work is not reproducible and we can't see what you did to get to your numbers. A million examples of why this is bad.
Also
> 1. Python script is good enough
You mean Python with pandas and numpy?
I use R which is also a great choice
> 2. Java/Scala is way to go.
For you but the vast majority of Data Scientist don't use either and their choice for people is not universal. Julia looks like a great new comer. I again mainly use R.
> 3 & 4 are good points.
Ray Panko, University of Hawaii.
http://panko.shidler.hawaii.edu/SSR/
http://panko.shidler.hawaii.edu/SSR/
Goes back to 1993:
Though in many non-techies things, like daily sales transactions it is a way to go.
Ad 1. pandas/numpy would put it on par with 2.
Ad 2. Would disagree. I know data scientist using Spark. Mostly they like Scala API.
In general, everyone got their favorite weapon of choice and what they feel comfortable. The point is that simpler solutions sometimes are just enough do their job.
Renting r3.4xlarge on AWS for an hour and play with your favorite tool may be an orders of magnitude easier/cheaper/faster than using big data solution.
Actually, I would bet that some 50% of time people are importing numpy or pandas they really don't need it
Like for calculating the square root of a number. Or the average of a short list
I get who this is aimed at, and why, but just saying that it fits in RAM isn't as useful as it could be. This is an opportunity to teach, not just snark.
I work on something like that.
[1] http://ark.intel.com/products/84688/Intel-Xeon-Processor-E7-...
The SGI one with up to 2048 cores are larger in their single system images than most people have in their clusters.
The benefit of these systems is not really the ease of programming but the speed of interconnect.
List price of the Oracle one was 3 million a few years ago. But most of that is actually in the high density dimms. These days I think the price must be lower, but I won't waste my Oracle sales contact time in figuring out what it is today. Of course it will still be expensive, it is an Oracle product after all.
However, an equivalent dell list price cluster of simple 1U boxes (512 6C/64GB ones!) will go for 1.5 million. The fact that to house 512 boxes i.e. 25 racks or so plus networking. Of course you do get 1/3rd more cores than the SGI one.
For many of us that are between the just use a single normal server and yet too small for the google solutions. These big memory solutions from Oracle and SGI can make sense even if they are not the first thing that comes to mind!
Replacing a large number of nodes with a single machine with a lot of RAM is usually a cost savings measure rather than a larger expense (and it saves power too!), and due to a lack of communications overhead and exploitation of the fact that you now have access to all the data in one go you may very well find that your algorithms run much faster.
A distributed solution should be a means of last resort.
var MAX_SENSIBLE = 6 * TB; function doesMyDataFitInRam(dataSize) { return dataSize <= MAX_SENSIBLE; }
While one can purchase servers with larger memory most likely you will run into limitation on number of cores. Also note that there is at least some overhead in processing data, so you would need at least 2X the size of raw data.
Finally while its a good thing to tweet, joke about and make fun of buzzword while trying to appear smart. The reality is that purchasing such servers (> 255 Gb RAM) is costly process. Further you would ideally need two of them to remove single point of failure. it is likely that the job is batch and while it might take a terabyte of RAM you only need to run it once a week, in all these cases you are much better off relying on a distributed system where each node has very large memory, and the task can easily split. Just because you have cluster does not mean that each node has to be a small instance (4 processors ~16 Gb RAM).
That's assuming that everything needs to be 'high availability' and buying two of everything is a must. This is definitely not always the case. In plenty of situations buying a single item and simply repairing it when it breaks is a perfectly good strategy.
RAM is the new disk: now for some, later for others.
Like right now the L1 cache will have latencies of 1 or 2 cycles, and the L2 cache 15; this is due to the overheads of cache coherency protocols, moving the data around the chip; it's not that the memory's slower, it's all SRAM.
They are probably referring to enterprise workloads. Here you have large working sets (so caches are less useful) and you want maximum throughput. Clever multithreading (finegrained) can reduce effective latency by scheduling many (32?) processes at the same time, executing an instruction from each in round-robin fashion (see Sun Niagara). In that case, you can sometimes dump the L1 cache, and you would be able to get rid of the memory hierarchy.
There's also probably a benefit wrt hard drives/secondary storage; you can obviously make system storage very fast, which might improve random access times considerably. BUT this is probably not going to be transformative; it'll improve certain types of accesses, but current algorithms are already very highly tuned to spatial and temporal locality of reference. Furthermore, you'll still see these structures win out, because they can take advantage of hardware prefetching more easily.
It's not as easy as just buying more RAM. You'll have to pay more attention to how you make use of the various caches in between your CPU and RAM.
If you're expecting growth in the size of your dataset (beyond growth in RAM size availability), then, well, maybe don't just use a single machine. Same goes for a whole bunch of similar "it's too large for a single machine" considerations.
Storing data should probably still be persisted to disk, and backed up.
Generally though, these posts are geared towards machine learning people that don't really have "live" data as frequently.
Well, well, well.
If you don't have money, you can't. Very few people can afford it.
You can program R in Spark you can now program in R http://blog.revolutionanalytics.com/2015/01/a-first-look-at-...
Now you can work directly with SQL Server as announced this week by MS. http://www.computerworld.com/article/2923214/big-data/sql-se...
I have had a ton of arguments about R's "biggest weakness" being that it uses RAM. I haven't once in the almost 3 years of working in R that I ran into this road block, but I am sure others have. Which there are several good distributed choices that will keep getting better and better.
Using RAM instead of Distributed is better in R as well as really any other language in terms of complexity and flexibility.