Skip to content

Top New Best Ask Show Jobs

Scaling Analytics at Amplitude | Better HN

Scaling Analytics at Amplitude (opens in new tab)

(amplitude.com)

67 pointsblader10y ago33 comments

33 comments

jallmann10y ago

What shortcomings of Redis set operations does the in-memory data store address, and how?

Unrelated rant: regardless of its merits, "Lambda" Architecture is probably the most annoying overloaded term in use today, second only to "Isomorphic" Javascript. Just because something has a passing resemblance to the functional style doesn't grant license to re-appropriate a well understood term of art.

paladin31415910y ago

Redis is a great piece of software, and we leverage it for several uses cases outside of managing sets. For our use case, there were a couple of blockers that prevented Redis from being a viable solution:

1. It's tricky to scale out a Redis node when it gets too big. Because RDB files are just a single dump of all data, it's not easy to make a specific partitioning of the dataset. This was a very important requirement for us in order to ease scaling (redis-cluster wasn't ready yet -- we've been following that carefully).

2. When you store hundreds of GB of persistent data in Redis, the startup process can be very slow (restoring from RDB/AOF). Since it can't serve reads or writes during this time, you're unavailable (setting up a slave worsens the following problem).

3. The per-key overhead in Redis (http://stackoverflow.com/questions/10004565/redis-10x-more-m...). We have many billions of sets that are often only a few elements in size -- think of slicing data by city or device type -- which means that the resulting overhead can be larger than the dataset itself.

If you think about these problems upfront, they're not too difficult to solve for a specific use case (partition data on disk, allow reads from disk on startup), but Redis has to be generic and so can't leverage the optimizations we made.

chupy10y ago

Hello Jeffrey, First I wanted to say that your post is very nicely written and full of juicy details! :)

Regarding the sets database, I had to solve quite a similar problem at the company where I work and instead of sets I actually chose to use the Redis HypeLogLog structure instead of sets because for near real time results you just need an approximate count of the sets / or their intersection and you don't need to know the specific set members. I just wanted to let you know that it works great for us for with doing intersections (PFMERGE) on sets containing hundreds of millions of members. If anybody is interested I can do a writeup about it.

Did you ever consider using that?

daddykotex10y ago

I also wonder what sacrifices were made to the design used in Redis so that it is able to handle better performance.

angryasian10y ago

Out of curiosity why weren't products like Druid http://druid.io/ or influxdb https://influxdb.com/ or possibly opentsdb taken into consideration ?

paladin31415910y ago

To be totally honest, there are so many technologies out there that claim to solve analytics that it's tough to seriously consider all of them.

That said, we have looked at Druid, which is also a good example of using lambda architecture in practice (http://druid.io/docs/0.8.0/design/design.html -- note the historical vs realtime distinction). They use many of the same design principles as us, and one of our sub-systems is very similar to it. We still believe the pre-aggregation approach is critical for performance in our use case, though. Lastly, when we started building the architecture (mid-2014), Druid was very new, and I'm generally wary of designing everything around a new and potentially unstable piece of software.

fangjin10y ago

Druid does pre-aggregation (roll-up) of data at ingestion time and is also used at scale (30+ trillion events, ingesting over 1M+ events/s) by numerous large technology companies: http://druid.io/druid-powered.html

angryasian10y ago

Just to note, with druid you are able to have preaggregated tables based on dimensions. Overall good article and thanks for sharing.

yresnob10y ago

"Finally, at query time, we bring together the real-time views from the set database and the batch views from S3 to compute the result"

so how in the heck does this work? at query time you decide what file to get our of s3 (hwo do u decide this?), parse it, filter it, and merge with the results from the custom made Redis like real time database?

paladin31415910y ago

The files in S3 are pre-aggregated results keyed by how we fetch them (e.g. there will be a file containing all of the users active on a particular day). What you've described is a pretty accurate description of what happens :)

sskates10y ago

We'll be sharing more about our query architecture in the future as well as other parts of the stack that we haven't included here. The query layer is an impressive piece of architecture that handles fast access to multiple distributed data stores.

paladin31415910y ago

Author of the post here. Happy to talk about how we've designed/built our architecture at Amplitude!

tinco10y ago

How did you guys split the databases per customer? Is it all one big stream of data for you or does it get split at a pretty early level? Is data of multiple customers in every database or do you maintain a cluster per customer?

paladin31415910y ago

Most of our databases are multi-tenant, so a single cluster will handle all customer data. The exception is Redshift, which has a separate cluster for each customer since we allow them to have direct access to it (https://amplitude.com/blog/2015/06/05/optimizing-redshift-pe...).

ddorian4310y ago

You could store the sets in postgresql arrays(to remove row overhead) (1GB maximum field size) and build some efficient union,intersect functions so you wouldn't have to unnest?

paladin31415910y ago

We tried a variety of PostgreSQL-based approaches, including this one. Unfortunately, the way you do set insertions using arrays is to do a O(n) membership check, which means your set will take O(n^2) to construct -- very inefficient.

optimusclimb10y ago

Did you use Camus for ETL, and if so, did you have to modify it to work with S3?

paladin31415910y ago

We don't use Camus; IIRC, it didn't exist at the time that we built most of the infrastructure. We just read data directly out of Kafka using client libraries.

ecesena10y ago

> in-memory database holds only a limited set of data

MemSQL is not just in-memory, but also has column-store (note: I don't know VoltDB). You can think of MemSQL not as "does everything in-memory", but "uses memory at the best".

msaspence10y ago

How do you decide what sets of users you pre aggregate?

It seems like without some limits in place you could end up with huge number of sets, especially if you are calculating these based on event properties.

paladin31415910y ago

That's a great observation. Somewhere along the spectrum of query flexibility you reach a point where pre-aggregation doesn't work anymore. We have a separate column-store based system in place for certain types of queries which we'll almost certainly blog about in the future!

msaspence10y ago

How does that then affect your cost efficiency. What clever things are you doing with this to keep your costs so low?

I guess we'll find out in a future post.

hopel10y ago

Do you store raw data ingested from Kafka directly in S3 or have an intermediate database for hot data?

paladin31415910y ago

The raw data is stored directly into S3 every hour, but there are multiple systems that process the data directly from Kafka in order to produce the real-time pre-aggregated views.

msaspence10y ago

Its easy to see how you can calculate segments, retention and trends using user sets, but how do you calculate funnels.

jchrisa10y ago

The lambda architecture and the split between the heavy slow processing and the interactive processing reminds me of how a few of our customers are blending Hadoop and Couchbase for similar use cases: http://www.couchbase.com/fr/ad_platforms

j / k navigate · click thread line to collapse