Amazon RedShift vs. local PostgreSQL (opens in new tab)

(rarestblog.com)

73 pointsrarestblog13y ago32 comments

32 comments

What this test is essentially doing is comparing Postgres against a single node of Redshift. It is not surprising that Postgres is faster. But Redshift is not meant to be used on a single node.

What Postgres & Redshift represent are are two different products for two very different problems. Postgres is good for small sets of transactional data like orders in a shopping cart system (less than 1TB). Redshift is good for big sets of data involving user behavior and clickstream analysis (greater than 1TB). I would not want to manage clickstream data on a single instance of Postgres nor would I want to manage an order system in Redshift.

A better test of Redshift would be to see how it compares to Asterdata...particularly with both in AWS. That should be telling.

eksith13y ago

We don't run a shopping cart, but one of our databases at present is at 11.3TB on PostgreSQL 9.1 and we're by no means dealing with small sets. We routinely juggle several Gigs at a time when we need to do analytics. We didn't see a reason to put this on a cloud since bandwidth + electricity is still cheaper for us than bandwidth + storage in the cloud at present.

monstrado13y ago

If you have a few servers to spare, I'd recommend installing Cloudera Impala on them. You can use Apache Sqoop to pull the data out of Postgres and into HDFS.. Directly after, you can run SQL queries which will query the data in parallel (similar to redshift).

monstrado13y ago

I don't think comparing RedShift to Postgres is accurate, RedShift was not designed for transactions, it was designed to store/query billions of rows using a columnar storage format...it's more like an analytic database (Greenplum, Teradata). Also, these databases are designed to scale out, and so you usually don't really see compelling performance gains until you start adding a few nodes to help influence parallelization.

With that being said, I'd be interested to see how RedShift compares to Impala.

lcampbell13y ago

I really don't understand what's going on here.

* You're measuring request latency. What part of that (for RedShift) is due to the network? (EDIT: I re-read and saw you're using `SELECT 1` as a gauge for round-trip latency and subtracting it from the results. Are you only doing this for RedShift, or also for local PostgreSQL? To me, it seems like that heuristic is over broad -- it encapsulates not only network latency, but syscall overhead, query parsing, etc).

* In your tests, PostgreSQL without indices performs on-par with RedShift. Does RedShift not support indexing? Is there some metric you're trying to show by not using indices? As designed, this benchmark does not map to any use-case I've ever seen.

rarestblogOP13y ago

* I noted in post that "SELECT 1" was done in 100ms round-trip, so I subtracted that from all points where it was reasonable. (No point to subtract it from 8 seconds).

* Not sure about the index support. Didn't try.

My idea was quite simple. I have some data at work (databases up to 30GB). Sometimes we hope to find something better. The main question was - will RedShift help, will it be radically faster? Will it be radically easier?

The answer for me - no, it won't help in my case, we need that 30GB data in real time, it looks like RedShift is more when you have 1TB+ data. Yes, it is radically easier.

lcampbell13y ago

Thanks for the reply. I figured for a dataset of that size, the main bottleneck might be not indexing -- maybe RedShift stores row data in a higher-latency medium while keeping indexes in-memory. Curious, I checked the documentation[1] and found this:

> Amazon Redshift doesn’t require indexes or materialized views and so uses less space than traditional relational database systems.

Reading through the rest of their FAQ, it sounds like they echo your conclusion -- RedShift shines the most for use-cases where the dataset is large enough that, to use PostgreSQL, you'd have to shard out multiple instances.

[1] http://aws.amazon.com/redshift/faqs/#0030

nieksand13y ago

I don't see your python code defining a distribution or sort key for Redshift which is an important design consideration. (For my own use case of log analysis, I sort on datetime and use an "even" distribution). Also doesn't look like you ran "vacuum" or "analyze" after doing the loads to Redshift. So the query optimizer has no statistics to drive its decisions.

And as others have pointed out, your 30 GB data set is pretty tiny. You could look at some of the in-memory DB options out there if you need to speed things up.

rubyrescue13y ago

very interesting. one of the reasons we picked mysql for a very high-volume app over postgres is that we have RDS and didn't want to do backups/snapshots/etc. Could we now use RedShift as a postgres-API RDS?

agotterer13y ago

You wouldn't want to replace RDS (or MySQL/postgres) with Redshift for OLTP workloads. Redshift is built for analytics and complex query workloads. The major benefit is the ability to do MPP (Massive Parallel Processing) and distribute the work out to nodes. I believe you can have up to 100 nodes in the cluster with the 8XL machines and 32 with the single XL.

pcl13y ago

RedShift is tuned for analytics, so if you're using MySQL today for analytics, it could be a good fit. But if you're looking for more of an OLTP engine, RedShift is probably not the right choice for you.

amalag13y ago

You need to run this with a column store database like Infobright. Postgres is more of a transactional database, Infobright is suited towards the similar large dataset analytics that this is aimed towards.

scotth13y ago

Redshift isn't Postgres. It's built on top of Postgres, similar to how Infobright is built on top of MySQL.

zwily13y ago

Redshift is built on ParAccel, not Postgres. It speaks the Postgres API, but the storage engine is not Postgres.

1 more reply

eduardordm13y ago

I run 3 large oracle RDS instances I wonder if redshift could be effectively used the same way, we have been thinking about migrating to postgresql.

fusiongyro13y ago

How big is your data?

eduardordm13y ago

16tb per instance

1 more reply

Whitespace13y ago

Wouldn't it have been better to do an EXPLAIN ANALYZE for the timing measurements instead of having the results returned locally?

fusiongyro13y ago

It would certainly be interesting to see what Redshift returns when you do that.

rarestblogOP13y ago

Honestly I don't know. Do you think it would be better? Could you explain why?

kapilvt13y ago

because it shows the estimated cost (plan) of the queries independent of the magically subtracted network roundtrip time. it also serves to show if the redshift even supports it.it.

1 more reply

ozgune13y ago

This is a pretty interesting. I wonder how query performance differs between Redshift and local PostgreSQL for other types of benchmarks as well, say TPC-H queries. (And I guess how Redshift scales out as the dataset size increases in TPC-H.)

csummers13y ago

I'd like to see some more information about the local setup, including hardware and the postgresql.conf. Otherwise, this tells me very little in terms of comparison.

rarestblogOP13y ago

In anyway this test won't tell you much, just how different systems behave to bigger load.

The local setup was quite usual: PostgreSQL 9.2, Mint 13, default conf in VirtualBox in iMac i5 12GB. (read: home computer, no tuning)

For me the result is that mostly RedShift is on par with local PostgreSQL, sometimes even winning for <5M rows. So with better PostgreSQL tuning you can probably stretch it, but not for as much as RedShift can do for REALLY big data.

Also the big deal was that RedShift scaled linearly.

fusiongyro13y ago

It scaled linearly, but also went unresponsive for five minutes. (Yeah yeah, it's a new service).

The default Postgres configuration is pretty weak. work_mem is set way to low, for instance, and that's bitten me a few times. I wouldn't say it's unrealistic--lots of people run with it that way in production and never find out how easily they could speed things up. Even me, for years.

But ultimately I'm more swayed by your interaction with it and I hate the endless benchmark tweaking that comes after every blog post about performance testing stuff. The point of this Redshift thing is hugeness first and foremost, so it's interesting.

dragontamer13y ago

At least on the small scale, I'd expect realistic "local" hosting to outperform any offering from Amazon.

For example, until your data scales above 20 GB, you'd be able to host it on a $5 SSD-based server from Digital Ocean. Most databases are bottlenecked by I/O. So switching to SSDs gives you the biggest performance boost.

On the high end, Amazon's offer probably will be better. (After all, the major draw to Amazon is "automatic scaling", so that you don't have to worry about Replication or other server administration duties at the high end). But considering how powerful a $5 SSD-virtual machine is today, I think a more realistic test would be with some sort of SSD-based cloud server.

ramarnat13y ago

Like others have mentioned, Postgres and Redshift are very different animals, Postgres is a row store and Redshift is a column store. On large data sets analytic queries that return a few columns will significantly outperform a row store DB.

We have found that Redshift is comparable to other columnar databases we work with, while we cannot publish any comparative benchmarks, we did put a blog post on what we found (link in another comment here)

crazydoggers13y ago

Data warehousing often involves star schemas, which means lots of joins in your queries. I'd love to see how a real world OLAP tool performs on this.

ramarnat13y ago

While you could use Redshift as a source for OLAP, most OLAP tools will have their own data store. But if you are referring to ROLAP, then it can perform well if tuned properly for the star schema. This would include BI tools like Microstrategy and Mondrian with Jaspersoft.

The main issue with Redshift is the lack of multiple sort orders on a table. Take a look at our blog post on first impressions gleaned during the preview. Disclosure: we are one of a couple of systems integrator partners for Redshift.

http://www.full360.com/2013/02/14/aws-redshift-full360-first...

j / k navigate · click thread line to collapse

32 comments

iblaine13y ago

What this test is essentially doing is comparing Postgres against a single node of Redshift. It is not surprising that Postgres is faster. But Redshift is not meant to be used on a single node.

A better test of Redshift would be to see how it compares to Asterdata...particularly with both in AWS. That should be telling.

eksith13y ago

monstrado13y ago

With that being said, I'd be interested to see how RedShift compares to Impala.

lcampbell13y ago

I really don't understand what's going on here.

rarestblogOP13y ago

* I noted in post that "SELECT 1" was done in 100ms round-trip, so I subtracted that from all points where it was reasonable. (No point to subtract it from 8 seconds).

* Not sure about the index support. Didn't try.

The answer for me - no, it won't help in my case, we need that 30GB data in real time, it looks like RedShift is more when you have 1TB+ data. Yes, it is radically easier.

lcampbell13y ago

> Amazon Redshift doesn’t require indexes or materialized views and so uses less space than traditional relational database systems.

[1] http://aws.amazon.com/redshift/faqs/#0030

nieksand13y ago

And as others have pointed out, your 30 GB data set is pretty tiny. You could look at some of the in-memory DB options out there if you need to speed things up.

rubyrescue13y ago

agotterer13y ago

pcl13y ago

amalag13y ago

scotth13y ago

Redshift isn't Postgres. It's built on top of Postgres, similar to how Infobright is built on top of MySQL.

zwily13y ago

Redshift is built on ParAccel, not Postgres. It speaks the Postgres API, but the storage engine is not Postgres.

1 more reply

eduardordm13y ago

I run 3 large oracle RDS instances I wonder if redshift could be effectively used the same way, we have been thinking about migrating to postgresql.

fusiongyro13y ago

How big is your data?

eduardordm13y ago

16tb per instance

1 more reply

Whitespace13y ago

Wouldn't it have been better to do an EXPLAIN ANALYZE for the timing measurements instead of having the results returned locally?

fusiongyro13y ago

It would certainly be interesting to see what Redshift returns when you do that.

rarestblogOP13y ago

Honestly I don't know. Do you think it would be better? Could you explain why?

kapilvt13y ago

because it shows the estimated cost (plan) of the queries independent of the magically subtracted network roundtrip time. it also serves to show if the redshift even supports it.it.

1 more reply

ozgune13y ago

csummers13y ago

I'd like to see some more information about the local setup, including hardware and the postgresql.conf. Otherwise, this tells me very little in terms of comparison.

rarestblogOP13y ago

In anyway this test won't tell you much, just how different systems behave to bigger load.

The local setup was quite usual: PostgreSQL 9.2, Mint 13, default conf in VirtualBox in iMac i5 12GB. (read: home computer, no tuning)

Also the big deal was that RedShift scaled linearly.

fusiongyro13y ago

It scaled linearly, but also went unresponsive for five minutes. (Yeah yeah, it's a new service).

dragontamer13y ago

At least on the small scale, I'd expect realistic "local" hosting to outperform any offering from Amazon.

ramarnat13y ago

crazydoggers13y ago

Data warehousing often involves star schemas, which means lots of joins in your queries. I'd love to see how a real world OLAP tool performs on this.

ramarnat13y ago

http://www.full360.com/2013/02/14/aws-redshift-full360-first...

j / k navigate · click thread line to collapse