undefined | Better HN

0 pointsLeifCarrotson9y ago0 comments

All conclusions are only valid for similar workloads, but each of MapD and GPUs, Q/kdb+ and Xeon Phi, Redshift, Athena, Big Query, Presto, and Elasticsearch claim to be fast, inexpensive, easy to work with, and otherwise great for Big Data. Which ones really are fast? How fast is fast? How much is this going to cost? Do I need 5 nodes or 50?

A few examples of some useful conclusions:

- Just because a relatively well-optimized PostgreSQL database on a regular workstation takes 5 minutes to run a query doesn't mean you can't get special hardware to run that query faster than you can type.

- Spark + S3 + Amazon Elastic Map Reduce look like an ideal tool to work with large data, but they're pretty slow compared to better tools, and even compared to plain PostgreSQL.

- HDFS really is a lot faster than S3.

- Performance of an Xeon Phi 64-core CPU is within an order of magnitude to an NVidia Titan X.

- Loading 104 GB of compressed data into Q/kdb+ expands to 125 GB with and takes about 30 minutes, but on Redshift expands to 2 TB and takes many hours to upload on a normal connection, plus 4 hours to actually import!

- It might cost $5000 to custom-build a GPU-based supercomputer that can do these queries in under a second, but you can run similar queries if you're willing to wait for 5 minutes each by spinning up instances for a few dollars an hour plus a few more dollars an hour for storage, or by just running PostgreSQL on your workstation.

Also, not a conclusion, but it's incredibly useful to have a simple example exactly how to configure the tool and import some CSV data

0 comments

WhitneyLand9y ago

These conclusions don't seem very useful because either they are already well established or are not valid. Some examples:

Just because a relatively well-optimized PostgreSQL database on a regular workstation takes 5 minutes to run a query doesn't mean you can't get special hardware to run that query faster than you can type.

Already well established for years with systems like redis, and more recently with gpu databases, and other techniques posted on HN regularly.

Spark + S3 + Amazon Elastic Map Reduce...is pretty slow compared to better tools, and even compared to plain PostgreSQL.

Not valid because it doesn't generalize. It so much depends on type of work being done, system architecture, etc, that you can only say it may or may not be true.

HDFS really is a lot faster than S3.

This is already well established, Amazon states aa much right in the docs: http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-pl...

Performance of an Xeon Phi 64-core CPU is within an order of magnitude to an NVidia Titan X.

Not precise enough to matter because getting within 10x difference is not close to being competitive.*

Loading 104 GB of compressed data into Q/kdb+ expands to 125 GB with and takes about 30 minutes, but on Redshift expands to 2 TB and takes many hours to upload on a normal connection, plus 4 hours to actually import!

I don't see how it's possible for 104GB of csv text data to decompress into only 125GB. For cvs to compress only ~20%...doesn't make sense.

It might cost $5000 to custom-build a GPU-based supercomputer that can do these queries in under a second

No, two problems here. The hardware in question could have used 1 cheap CPU instead of two expensive Xeons and been much less expensive. Bigger problem: The MapD software itself will be $50,000.

LeifCarrotsonOP9y ago

The speed comparisons may be well known to you, but as someone only really using trivial desktop app SQLite databases, they weren't known to me. Thanks for pointing out my errors!

> I don't see how it's possible for 104GB of csv text data to decompress into only 125GB. For cvs to compress only ~20%...doesn't make sense.

The CSV file itself is around 500 GB. The internal representation, which might use binary formats for numbers, or compress text, uses 125 GB. Redshift expands it to 2TB for all the indexing and mapping.

> Bigger problem: The MapD software itself will be $50,000.

Ouch. That's a rather large oversight. Is the author affiliated with MapD, perhaps?

juliangoldsmith9y ago

>Just because a relatively well-optimized PostgreSQL database on a regular workstation takes 5 minutes to run a query doesn't mean you can't get special hardware to run that query faster than you can type.

This is actually my biggest complaint with the article. He used cstore_fdw with Postgres, which doesn't allow much real indexing, and as far as I can tell (knowing only a little bit about it) he didn't really use any of the benefits of cstore_fdw.

I'd be interested to see how plain Postgres, possibly on a compressed filesystem, with properly-indexed tables stacks up.

j / k navigate · click thread line to collapse