One thing of note in the graph is the tracking of response size. This would be very useful for 200 responses with "Error" in the text. Because then the response size would drop drastically below a normal successful response payload size.
In addition to Latency, Error Rates, Throughput and Saturation , folks like Brendan Gregg @ Netflix have recommended tracking capacity.
(bias alert - I work on Honeycomb)
I agree with other comments though the devil is in the details of how to actually setup these "golden signals" so that they are useful and not just drown everyone in packet level non-sense.
The other way of approaching it is to look for the additional latency it causes, which you can spot on a per-service basis.
I recommend against this, rather have one overall duration metric and another metric tracking a count of failures.
The reason for this is that very often just the success latency will end up being graphed, and high overall latency due to timing-out failed requests will be missed.
The more information you put on a dashboard, the more chance someone will miss a subtlety like this in the interpretation. Particularly if debugging distributed systems isn't their forte, or they've been woken up in the middle of the night by a page.
This guide only covers what I'd consider online serving systems, I'd suggest a look at the Prometheus instrumentation guidelines on what sort of things to monitor for other types of systems: https://prometheus.io/docs/practices/instrumentation/
The originally quoted advice, to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualization of that data makes clear the separation.
I absolutely agree that careful consideration is required when choosing what to put on dashboards to avoid confusion. That seems to be a separate issue.
(bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.)
This is a place where I think you guys could beat what other 3rd party monitoring tools are doing. I work with some of your guest bloggers, and I work on a subsystem with its own dashboard: about 50 charts. To make bringing new teammates a sensible experience, we need both a layer of alerts on top of the charts, and then a set of rules of thumb, that should be programmed if the alerting system was good enough, that put the alerts together into realistic failure cases: if X and Y triggered, but Z didn't, then chances are this piece is probably the culprit.
There's also opportunities in visualizations that aren't chart based: We used to have something like that for another complex system in another employer, but that's expensive, custom work, unless you join forces with something that understands were all your services are, knows all ingress and egress rules, and thus could automatically generate a picture of your system, along with understanding the instrumentation: So leave that until you merge with SkylinerHQ or something.
That said, I think you guys are heading towards a good, marketable product as it is. Fixing the annoying the statsd/splunk divide of older monitoring would probably make us buy it already.
Indeed. The first order issue is locating the problem though.
If you don't spot which of your microservices is the culprit due to only looking at successful latency, you're not going to get to the stage of comparing successful vs failed latency (and in practice, the increased error ratio combined with increased overall latency should tip you off).
> unless you feed them in to a system that an natively tease them apart as easily as show them together
And the user actually thinks to perform that additional analysis.
> So it seems like the disagreement is more about visualization than collection
What I've seen happen is that the collection leads to the visualisation, which subsequently leads to prolonged outages due to misunderstanding.
Thus I suggest removing the risk of the issue on the visualisation end, by eliminating the problem at the collection stage. This is particularly important when the people doing the visualisation aren't the same people writing the collection code, and thus don't know if the people creating the dashboards will all be sufficiently operationally sophisticated.
> to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualisation of that data makes clear the separation.
It's a little messier than that. Depending exactly on how the data is collected, such a split could make some analyses more difficult or impossible. For example I need the overall latency increase in order to see if this server is entirely responsible for the overall latency increase I see one level up in the stack, or if there's some other or additional problem that needs explanation. There's no equivalent math for success or failure.
Put another way, the math on the overall works the way your intuition thinks it does. The split out version is more subtle.
>(bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.)
I work on Prometheus which is a metrics system. Honeycomb seems to be based on event logs. There's logic to removing the success/failure split for duration metrics as I suggest, but it'd be insanity to remove it for event logs. So in your case it is purely a visualisation problem, whereas for us losing granularity at the collection stage is an option (and sometimes required on cardinality grounds).
The terminology the article uses (incrementing a counter at an instrumentation point) led me to believe we were discussing only metrics.
The way I would see things is that you'd use a metrics-based like Prometheus to locate and understand the general problem and which subsystems are involved, and then start using log-based tools like Honeycomb as you dig further in to see which exact requests are at fault. They're complementary tools with different tradeoffs.
I've written about this in more depth at http://thenewstack.io/classes-container-monitoring/
What exactly would that be?
you should only have simple dashboards (and alerting) for KPIs and end to end checks. Everything else should be instrumented and debugged using a real time sorting and slicing tool, esp if you have a complex system (microservices, distributed system, polygot persistence).
Counting incoming and outgoing requests misses a lot of potential data points when determining "is this my fault?"
I work mainly in system integrations. If I check for the ratio for input:output, then I may miss that some service providers return a 200 with a body of "<message>Error</message>".
A better message is to make sure your systems are knowledgeable in how data is received from downstream submissions, and to have a universal way of translating that feedback to a format your own service understands.
HTTP codes are (pretty much) universal. But let's say you forgot to inlcude a header or forgot to base64 encode login details or simply are using a wrong value for an API key. If your system knows that "this XML element means Y for provider X, and means Z in our own system", then you can better gauge issues as they come up, instead of waiting for customers to complain. This is also where tools like Splunk are handy, so you can be alerted to these kinds of errors as they come up.
If the "things calling you" can't be effectively throttled, you often run into issues like, for example, hitting the limit on number of open sockets, file descriptors, receive queue, threads etc.
So, just saying "the downstream service is at fault" isn't really correct. Your service may also not be acting correctly in that situation. Those issues can also affect your logging and metrics.
It's not a trivial exercise to architect your service such that it always does the right thing (throttling input vs retries vs fail fast vs priority queues vs load balancing to multiple instances of a downstream service, exponential backoff, etc) when a downstream dependency is slow and/or down.
Edit: Similar to your observation about structured errors, connection pooling is probably also worth talking about in this situation. Which would change the stats you want...once # of connections made isn't the same thing as # of transactions, you would want to know both.
This is so much easier said than done. Most time series db that people use to instrument things quite simply cannot handle histogram data correctly. They make incorrect assumptions about the way roll-ups can happen or they require you to be specific about resolution requirements before you can know them well.
Then histogram data tends to be very expensive to query so it bogs down preventing you from making the kinds of queries that are really valuable for diagnosing performance regressions.
Finally, the visualization systems for histograms are really difficult because you need a third dimension to see them over time. Heat maps accomplish this but are hard to read at times and most dashboard systems don't have great visualization options for "show this time period next to this time period" which is an incredibly common requirement when comparing latency histograms.
We don't have the visualizations for histograms yet (though you can chart specific percentiles), but for the reasons you mention, Honeycomb is perfectly suited to give you that kind of data. I can't say we'll get that out the door soon, but it's one of my pet most wanted features so as soon as I can convince myself it's actually more important than all the other mountain of things that need to get done, you'll get your histograms and your time over time comparisons.
I've been advocating for a heat map style presentation of histograms for a long time, but I hadn't considered the difficulty that creates when trying to show time over time. That's an interesting one to noodle on.
Thanks for articulating well the value and reasons for difficulty in implementing histograms!
(bias alert - I work on Honeycomb)
Is this normal usage? Seems reversed to me.
To rephrase that, "upstream" means "where events come from".
If the river is data, then stuff you depend on is upstream from you, and things that depend on you are downstream.