- When you want to compute metrics for multiple intervals (hour / day / month / etc) Redis' MULTI / EXEC constructs make transactional updates to multiple keys a snap. Additionally batching (which is supported by most Redis clients) can dramatically improve performance.
- You can use Redis sets for computing uniques in realtime. You can also use set operations like SUNION to compute uniques across multiple time periods relatively quickly. For example, SUNION 24 hour intervals to get the total uniques for the day. You just have to be careful that large numbers of uniques eat up your available memory very quickly. EXPIREAT helps ensure things get cleaned up automatically.
- Using a Redis list as an event queue is a great way to further ensure atomicity. Use RPOPLPUSH to move events to a 'uncommitted' queue while processing a batch of events. If you have to rollback, just pop them back on to the original list.
I'll make sure I use batching first, and look into the unions technique after that.
"I've done implementations of the above using SQL databases (MySQL) and it wasn't fun at all. The storage mechanism is awkward - put all your values in a single table and have them keyed according to stats name and period. That makes querying for the data weird too. That is not a showstopper though - I could do it. The real problem is hitting your main DB a couple of times in a web request, and that is definitely a no-no."
This is not a SQL vs NOSQL issue: decoupling the reporting system from your main (production/transaction) system is a widely advised practice in "business intelligence".
Use a different instance, with a schema designed for reporting.
You can use Redis for that (and I use it actually!) but you can also use MySQL or any other RDBMS.
It's fairly easy to implement: one line for each fact, then foreign keys to a date dimension and hour dimension (see [1]), then you can sum on date ranges, hour ranges, drill down etc, on many different metrics.
[1] https://github.com/activewarehouse/activewarehouse-etl-sampl...
The pros are that it's very easy to setup etc (no schema definition, very practical API, easy to query), the cons are that you are limited by the memory space (but like you wrote, not an issue in your cases) and that it's harder to make more elaborated reports.
But I use both techniques depending on the needs.
Thanks for taking the time to write this!
Rather than iterating over the entire list of series and checking for expired elements you can use a sorted set and assign a time-based score. The cron job can still run once a day but you can find items in that sorted set that have members below a certain score threshold, which will almost certainly be faster.
Naturally this will increase memory usage (which may be undesired) but it's food for thought. Eventually the looping and trimming expired hashes can be coded using lua server-side scripting in redis-2.6, which is interesting in a different way and has it's own challenges.
Therefore, it makes it kind of difficult to use ZSETS for metrics unless you only care about uniques.
- https://github.com/antirez/redis-timeseries
- ZSET pull request https://github.com/antirez/redis-timeseries/pull/1
- and the resulting "dsl uptime" script I made out of it: https://github.com/thbar/dsl-uptime
Thinking of it, you could probably remove the old value and insert the new (incremented) one, but that seems both slower and non-atomic.
We use our analytics engine to show charts to our users as well. I can't do that with graphs generated from tools like Cacti/Ganglia/Graphite... As is the case with almost all sysadmin tools, they don't look too good.
If you're curious about it and have any questions, feel free to get in touch (email in profile).