Open-Sourcing Rearview: Real-Time Monitoring With Graphite (opens in new tab)

(techblog.livingsocial.com)

48 pointswastedbrains12y ago16 comments

16 comments

foz12y ago

At my company we've been using Graphite and StatsD for nearly two years now, we rely on it heavily for tracking performance and troubleshooting issues. We rely on Icinga, Pingdom, NewRelic and other tools to alert of us of problems.

Often, when things have gone really wrong (DoS, internal network issues, app errors, disk full) the affected machine(s) stop reporting to graphite (or under-report data). We get alerted by monitoring the services, not the stats.

Being alerted about low or unusual values might be helpful in some cases, but based on my experience, it would too noisy. Usually when something bad happens, we anyway investigate Graphite and analytics tools to understand the impact on traffic and KPIs.

I could see Rearview being useful for some cases, but not as a replacement for real monitoring and alerting tools.

sakers12y ago

We use NewRelic and Pingdom as well. Where Rearview really shines is creating monitors like this: 1) control charts to alert when a process deviates from a range of 3 stdev above or below the mean based on historical data (e.g. purchases/logins are lower than expected, process failures are higher than expected, etc.), 2) deployment triggered monitors that automatically analyze data before and after a deploy for shifts in mean or increases in variance (e.g. do we see more login failures after this deploy, do we see more 4xx/5xx responses, did page load time increase, etc.), 3) response time monitors... while this seems straightforward enough, Rearview can not only tell you when a service or page response time has exceeded some statistical limit, it can also present you with more information regarding causes (e.g. this process is slow because of an issue with the database, redis, a dependent process/service, etc.), 4) it allows you to use SPAN as a means of monitoring load time or response time (SPAN is the 95th percentile - the 5th percentile and it give a much more accurate representation of what users experience than mean or median, 5) process efficiencies can be checked by making sure they complete on time and execute the expected number of commands (e.g. sent email, updated databases, etc.), and many more. Basically you are only limited by your imagination and coding skills. Of course the other benefit is in performing similar monitoring on business metrics and not just application performance (e.g. is funnel performing as expected/needed, are our customer tools being used on a regular basis, are our marketing campaigns paying off, etc.)

SEJeff12y ago

In my currently non-existent freetime, I'm a Graphite co-maintainer (check github). If you have any improvements or suggestions, please feel free to send us pull requests. The current pull requests are a bit of a mess, but I blame myself and will be getting around to merging a ton of them "real soon now TM".

foz12y ago

Thank you for your work on Graphite. For all it's UI strangeness and quirks, it is a great solution that a lot of people love (myself included).

I'll peek at the pull requests and see if my company might be able to contribute some help.

talbright12y ago

Rearview compliments these services, and is not intended as a replacement for them. While there is overlap, the scope is different.

Pingdom will tell you that your engine just threw a rod. Rearview will tell you your rods are knocking before that happens.

jwatte12y ago

Server side graphs didn't work out for all our monitoring use, so we don't use graphite. You should make a version that works with istatd :-) https://github.com/imvu-open/istatd

sakers12y ago

I'll definitely check that out! I'm also thinking we may need to add support for a time-series database as Graphite does have its limitations.

dekz12y ago

This looks really polished and definitely a great idea. I can see why you chose Ruby for the scripting of the monitors, being able to evaluate that code in a predefined binding can be quite powerful, especially with the aid of helpers being pre-defined as well.

Why not a full ruby stack, or was the "live" scripting done after the initial inception?

sakers12y ago

We have always used Ruby for the scripting (we're predominately a Ruby shop so this was key for future adoption.) The very first mvp for this tool was individual Ruby scripts running against Graphite and being scheduled via cron. The first real backend scheduler was built in Scala, but for various reasons we've converted to Rails/Puma/Celluloid running in a VM using Jruby. The monitors themselves run in an MRI sandbox for security purposes.

fit2rule12y ago

I'm not sure I'm ready to abandon a custom monitoring environment consisting of a shell environment, screen, ssh certs, lugubrious quantities of /proc/, and a fair bit of gnuplot. Seems to me thats all you need? Why commit to a Ruby install for an operator console?

sakers12y ago

I'm not sure lugubrious means what you intended it to mean. :) At any rate, see my reply here https://news.ycombinator.com/item?id=6646402 for a sampling of things Rearview brings to the table. The tl;dr is that it's not a NOC tool, it's more for process monitoring whether that be application processes, engineering processes, or business processes. It also does provide a central location for anyone to see the state and history of an application or business unit.

fit2rule12y ago

>The tl;dr is that it's not a NOC tool, it's more for process monitoring whether that be application processes, engineering processes, or business processes. It also does provide a central location for anyone to see the state and history of an application or business unit.

Ah. I've usually just used email for that. :)

talbright12y ago

Actually, Rearview started out similar to that. We wanted something more accessible to our large engineering team. It has all the advantages of a web user interface coupled with the powerful capabilities you get with a scripting language.

mh-12y ago

this looks fantastic. thanks for releasing it.

the UI is quite polished

sakers12y ago

Your welcome! We did a second feature release of our Ruby version today that has even more UI goodness. Basically we've added the ability to group categories of monitors under one dashboard. You can then switch between categories using carousel controls or direct from drop down. We're hoping to open source this version soon and crossing our fingers that the Ruby version will see more collaboration from outside developers.

mh-12y ago

sounds really cool. looking forward to trying this out

j / k navigate · click thread line to collapse

16 comments

foz12y ago

I could see Rearview being useful for some cases, but not as a replacement for real monitoring and alerting tools.

sakers12y ago

SEJeff12y ago

foz12y ago

Thank you for your work on Graphite. For all it's UI strangeness and quirks, it is a great solution that a lot of people love (myself included).

I'll peek at the pull requests and see if my company might be able to contribute some help.

talbright12y ago

Rearview compliments these services, and is not intended as a replacement for them. While there is overlap, the scope is different.

Pingdom will tell you that your engine just threw a rod. Rearview will tell you your rods are knocking before that happens.

jwatte12y ago

Server side graphs didn't work out for all our monitoring use, so we don't use graphite. You should make a version that works with istatd :-) https://github.com/imvu-open/istatd

sakers12y ago

I'll definitely check that out! I'm also thinking we may need to add support for a time-series database as Graphite does have its limitations.

dekz12y ago

Why not a full ruby stack, or was the "live" scripting done after the initial inception?

sakers12y ago

fit2rule12y ago

sakers12y ago

fit2rule12y ago

Ah. I've usually just used email for that. :)

talbright12y ago

mh-12y ago

this looks fantastic. thanks for releasing it.

the UI is quite polished

sakers12y ago

mh-12y ago

sounds really cool. looking forward to trying this out

j / k navigate · click thread line to collapse