A few examples of some useful conclusions:
- Just because a relatively well-optimized PostgreSQL database on a regular workstation takes 5 minutes to run a query doesn't mean you can't get special hardware to run that query faster than you can type.
- Spark + S3 + Amazon Elastic Map Reduce look like an ideal tool to work with large data, but they're pretty slow compared to better tools, and even compared to plain PostgreSQL.
- HDFS really is a lot faster than S3.
- Performance of an Xeon Phi 64-core CPU is within an order of magnitude to an NVidia Titan X.
- Loading 104 GB of compressed data into Q/kdb+ expands to 125 GB with and takes about 30 minutes, but on Redshift expands to 2 TB and takes many hours to upload on a normal connection, plus 4 hours to actually import!
- It might cost $5000 to custom-build a GPU-based supercomputer that can do these queries in under a second, but you can run similar queries if you're willing to wait for 5 minutes each by spinning up instances for a few dollars an hour plus a few more dollars an hour for storage, or by just running PostgreSQL on your workstation.
Also, not a conclusion, but it's incredibly useful to have a simple example exactly how to configure the tool and import some CSV data