Ask HN: What observability effort have had the best impact on your company?

11 pointsjturolla7y ago3 comments

I'm currently leading a observability team at a 150+ engineer, 150+ services company. We spend the last 6 months building infrastructure around Prometheus, including instrumentation, alerting rules, high availability and scalability/sharding. We have a multi-kubernetes-cluster setup and multi-customer-shards with a high growth rate.

When we started, our observability stack was composed of Riemann, Splunk and thousands of unloved cloudwatch alerts. We felt Riemann was very difficult to work with for our setup and our multiple shards, and we evaluated what options we had, finally choosing Prometheus.

We spend a lot of time pairing with squads so they can learn Prometheus and work on their own creating instrumentations and alerts.

We're in the middle of the rollout for Prometheus. Roughly 50% of the services and squads are using metrics and alerts, and this is on-track to becoming 90% until the end of the next quarter.

Having a metrics and alerting platform ready to use is making engineers much more confident on their services, but we're still facing some incidents that are reported by customer support instead of engineering tools, which could mean either we're missing something, or we just have to double down on the rollout to make sure everything is covered and alerting correctly.

Could you share your thoughts and experience around observability?

3 comments

wjossey7y ago

One thing I'd encourage is to build a culture of looking at graphs daily as a matter of practice. Alerts are only as effective as the knowledge of the writer, and human based detection is almost-always the first step to better alerting.

By looking at graphs on a daily basis, one can become much more rapid in being able to diagnose issues quickly, and also understand what the root cause is of the underlying issue (assuming you're measuring XYZ that broke). So many times I was able to look at a graph and go, "Something isn't right", long before an alert would go off. I wasn't magical, I just had spent a little bit of time every day reviewing graphs, so that I could pick out subtle changes across a handful of graphs very rapidly, which would have also been hard to pre-emptively build checks around.

One additional thought is to not over-alert. Alert fatigue is your greatest enemy, and needs to be avoided at all costs. If you're not reacting to an alert, you should question whether or not you need it.

jturollaOP7y ago

> Alerts are only as effective as the knowledge of the writer

We're enforcing that alerts have a link to a playbook where there are more datails about the issue and how to mitigate them.

About over-alerting, we had already been over alerting before we began the taskforce, and one of our efforts is removing cloudwatch and riemann alerts that are usually noisy and do not offer a lot of insight on the real issues.

troycarlson7y ago

Any suggestions for getting graphs on a TV in the office? I usually have monitoring tools like New Relic open in my browser but I'd like to have these aggregated on a wall mounted TV for the team to look at together.

j / k navigate · click thread line to collapse