undefined | Better HN

0 pointsmalisper9y ago0 comments

Most of our queries depend more on the time of the events rather than the user the events belong to. For example, let's say you want to know how many users signed up on Monday and logged in again before Friday. That query would fetch all sign up events and all log in events over the rest of the week, do a group by user_id, and use a custom udf to perform the aggregation. We never actually fetch multiple events from a user at a time. Instead we look for specific types of events in a given time range and group by the user. Clustering by time winds up being a much bigger win (benchmarks showed 10x compared to sorted by user for some uncached queries) here as almost all of our queries are constrained to a given time period.

Currently, maintaining the clustering has only been best-effort. We sort our data whenever we copy it from one location to another and the data comes in sorted by time, so it's fairly easy to maintain a high row correlation with time.

0 comments

ngrilly9y ago

My question was probably not clear enough... I'm asking about clustering by customers/tenants (i.e. Heap customers), not by users (i.e. the users of Heap customers).

drob9y ago

The data is sharded by customer and then sub-sharded by end user within the customer. For all but the tiny customers, 100% of the data on a logical shard will belong to the same customer. That means our subqueries will never touch data from more than one customer unless the customer is very small. (And, if the customer is that small, it should be easy to make the query fast anyway.)

ngrilly9y ago

Your answer is very useful. Thanks Dan!

May I ask how many logical shards do you have per physical shard/machine? And what is the average size of a logical shard on disk?

You wrote "the data is sharded by customer and then sub-sharded by end user within the customer", but malisper wrote above that "clustering by time winds up being a much bigger win". Isn't it contradictory?

malisperOP9y ago

There's two parts to it. We first split up data by different customers. At some point customers get big enough that having a single table is slow. Once a customer reaches a certain size we split up the end users into ranges and have separate tables for each customer, range pair. W typically limit each table to 800k events which is about 800MB of data. Then when we query, we use citus which automatically sends the proper queries to the specific tables necessary and then aggregates the results. Each individual table in our cluster is sorted by time.

1 more reply

j / k navigate · click thread line to collapse

0 comments

ngrilly9y ago

My question was probably not clear enough... I'm asking about clustering by customers/tenants (i.e. Heap customers), not by users (i.e. the users of Heap customers).

drob9y ago

ngrilly9y ago

Your answer is very useful. Thanks Dan!

May I ask how many logical shards do you have per physical shard/machine? And what is the average size of a logical shard on disk?

malisperOP9y ago

1 more reply

j / k navigate · click thread line to collapse