undefined | Better HN

0 pointsmalisper9y ago0 comments

There's two parts to it. We first split up data by different customers. At some point customers get big enough that having a single table is slow. Once a customer reaches a certain size we split up the end users into ranges and have separate tables for each customer, range pair. W typically limit each table to 800k events which is about 800MB of data. Then when we query, we use citus which automatically sends the proper queries to the specific tables necessary and then aggregates the results. Each individual table in our cluster is sorted by time.

0 comments

ngrilly9y ago

Thanks Michael for clearing my doubts :-)

One last question: I guess most queries target a time range. Do you use BRIN indexes to avoid scanning the whole 800 MB of data in each shard, and just read the necessary pages?

malisperOP9y ago

That get's into our indexing strategy which Dan talks about in this talk[0]. Currently our tables aren't completely insert only, so a BRIN index wouldn't work for us, as one row in the wrong place can cause a huge amount of extra reads.

[0] https://www.youtube.com/watch?v=NVl9_6J1G60

ngrilly9y ago

No more question :-) Thanks Michael for you time, and for the link to Dan's talk. It's great to see how, at Heap, you scaled a multitenant SaaS based on PostgreSQL!

j / k navigate · click thread line to collapse

0 comments

ngrilly9y ago

Thanks Michael for clearing my doubts :-)

One last question: I guess most queries target a time range. Do you use BRIN indexes to avoid scanning the whole 800 MB of data in each shard, and just read the necessary pages?

malisperOP9y ago

[0] https://www.youtube.com/watch?v=NVl9_6J1G60

ngrilly9y ago

No more question :-) Thanks Michael for you time, and for the link to Dan's talk. It's great to see how, at Heap, you scaled a multitenant SaaS based on PostgreSQL!

j / k navigate · click thread line to collapse