Redshift Performance and Cost (opens in new tab)

(nerds.airbnb.com)

145 pointsAirbnbNerds13y ago45 comments

45 comments

TY13y ago

Redshift is based on ParAccel, not on Postgres. ParAccel uses APIs similar to Postgres due to historical reasons, but not the technology.

For a basic overview: http://en.wikipedia.org/wiki/Paraccel

As for the rest of the article, it feels like a basic Data Warehousing 101 re-discovered. It should have been titled "Analytics: Back To The Future" :-)

meritt13y ago

No kidding. The amount of startups that have flocked to hadoop for "data analytics" over the past 5 years is extremely disheartening. Almost all of the cases are far more suitable for any off-the-shelf RDBMS much less a column-oriented one. Same thing with MongoDB.

How much time and money would have been saved learning Database Theory/SQL/Data Warehousing/Dimensional Modeling instead of cramming everything into an unstructured data-store?

nieksand13y ago

I think part of the issue why so many people have gone with Hive is that good, production-ready column stores are expensive. Redshift is posed to change that. If you're shopping in this space, Infobright is also worth checking out.

And even for moderate data sizes (10+ GB per table), row store DBs tend to become painful. This is especially true when you need to support ad-hoc reporting queries, since the usual technique of matching your schema, indexes, and queries won't be effective any more. With true ad-hoc reporting, your only hope becomes lots of shallow indices rather than ones tuned to a particular query.

2 more replies

fludlight13y ago

How much time and money would have been saved by running analytics on samples instead of the whole population?

mixedbit13y ago

Which off-the-shell RDBMS can handle queries over 3 billion rows?

5 more replies

taligent13y ago

Seriously can you and your ilk just please stop.

It's so exhausting to hear how much smarter you are and if we just educated ourselves we would realise the error of our ways. People who choose the technologies aren't stupid or masochistic. They understand their use case and the fact is that there are plenty of situations where SQL is suboptimal.

1 more reply

bufo13y ago

http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and...

bloomfilter13y ago

Thanks for pointing it out, we have correct it in our post

panarky13y ago

The article mentions this briefly, but it should be emphasized: parallel loading from S3 is MUCH faster.

This weekend I loaded 2 billion rows from S3 both ways:

- From a single gzipped object: 4 hours 42 minutes

- From 2000 gzipped slices of 1M rows each: 17 minutes

(Loading from gzipped files is considerably faster, in addition to saving S3 charges.)

The article notes that choice of distribution key is critical. I'd add that choice of sort key is equally important. In my testing, a better sort key improved compression from 1.5:1 to 4:1, and also made common queries 5x faster.

Unfortunately, you only get one dist key and one sort key per table, so less common queries could get slower.

fujibee13y ago

Also if you launch the more instance in a cluster, the faster to load. Our survey: http://www.slideshare.net/Hapyrus/scalability-of-amazon-reds... We tried much more files (5MB each) to load, but it takes longer time in total.. We're trying to get appropriate size and file numbers.

csarva13y ago

This is the second article I've seen where the authors forget to multiply by the number of redshift nodes. A single XL node is $0.85/hr so 16 nodes would be $13.60/hr. Still cheaper than their Hive configuration obviously but less than a buck?

taf213y ago

damn yeah, I initially thought wow for $632.4 bucks i'm setting this up next week or even later this week provisioning the redshift - but 16x that at $10118.4 - I'll continue wait... probably a good thing to stay focused on features anyway

AirbnbNerdsOP13y ago

Thanks for pointing this out! We just updated the article to reflect this.

monstrado13y ago

Which storage format did you use for Hive? This is very important to how performance plays out, are you using snappy or LZO compression? Also, this is a relevant comment from a Hive committer (http://news.ycombinator.com/item?id=5248485).

twog13y ago

Still running on posterous. Weird to think that this is probably the last posterous blog post I will read.

polskibus13y ago

It's good to see a Redshift evaluation. I'm wondering how does Redshift compare to hadoop airbnb setup when taking data loading and transformation into consideration as well as running aggregate queries? I mean if you want to run analysis fairly often, do you need to reload everything in Redshift? From maintenance point of view, is Hadoop setup more flexible and cheaper than Redshift?

bloomfilter13y ago

The hadoop setup we have is actually EMR, and we use s3 for data storage, so for us Hive/Hadoop doesn't save much in terms of data loading. And once you have a process setup to load and update the data in Redshift, you don't have to reload everything

serbaut13y ago

The first query seems awfully slow. I have a six node vertica cluster with a 100 column table with 7Bn rows in it and a similar query takes less than 3 seconds.

mallipeddi13y ago

Disclosure: I work on the Redshift team.

The OP's cluster is a 16-node hs1.xlarge cluster (has 3 spindles per node). There's actually a more powerful node-type hs1.8xlarge which has 24 spindles on each node. More info: http://aws.amazon.com/redshift/pricing/

So it's not fair to compare Redshift performance to your Vertica cluster unless the hardware is similar.

jacques_chester13y ago

Dumb question: was the data ETL'd into a star schema first? That can make a big difference, especially in columnar stores.

bloomfilter13y ago

We sort of made it a partial star schema, but not strictly. since random join can be expensive

j / k navigate · click thread line to collapse

45 comments

TY13y ago

Redshift is based on ParAccel, not on Postgres. ParAccel uses APIs similar to Postgres due to historical reasons, but not the technology.

For a basic overview: http://en.wikipedia.org/wiki/Paraccel

As for the rest of the article, it feels like a basic Data Warehousing 101 re-discovered. It should have been titled "Analytics: Back To The Future" :-)

meritt13y ago

How much time and money would have been saved learning Database Theory/SQL/Data Warehousing/Dimensional Modeling instead of cramming everything into an unstructured data-store?

nieksand13y ago

2 more replies

fludlight13y ago

How much time and money would have been saved by running analytics on samples instead of the whole population?

mixedbit13y ago

Which off-the-shell RDBMS can handle queries over 3 billion rows?

5 more replies

taligent13y ago

Seriously can you and your ilk just please stop.

1 more reply

bufo13y ago

http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and...

bloomfilter13y ago

Thanks for pointing it out, we have correct it in our post

panarky13y ago

The article mentions this briefly, but it should be emphasized: parallel loading from S3 is MUCH faster.

This weekend I loaded 2 billion rows from S3 both ways:

- From a single gzipped object: 4 hours 42 minutes

- From 2000 gzipped slices of 1M rows each: 17 minutes

(Loading from gzipped files is considerably faster, in addition to saving S3 charges.)

Unfortunately, you only get one dist key and one sort key per table, so less common queries could get slower.

fujibee13y ago

csarva13y ago

taf213y ago

AirbnbNerdsOP13y ago

Thanks for pointing this out! We just updated the article to reflect this.

monstrado13y ago

twog13y ago

Still running on posterous. Weird to think that this is probably the last posterous blog post I will read.

polskibus13y ago

bloomfilter13y ago

serbaut13y ago

The first query seems awfully slow. I have a six node vertica cluster with a 100 column table with 7Bn rows in it and a similar query takes less than 3 seconds.

mallipeddi13y ago

Disclosure: I work on the Redshift team.

So it's not fair to compare Redshift performance to your Vertica cluster unless the hardware is similar.

jacques_chester13y ago

Dumb question: was the data ETL'd into a star schema first? That can make a big difference, especially in columnar stores.

bloomfilter13y ago

We sort of made it a partial star schema, but not strictly. since random join can be expensive

j / k navigate · click thread line to collapse