It’s been around 8 years and we’re up to 50 or so people. I’d say we are okay at it.
We haven’t gotten fundamentally better over time recently, it’s more like there is some asymptote of how much you can really tell with a certain amount of insight into the systems between source and destination.
The only real progress we’ve made has been integrating with more and more sources of information about the state of the network.
That said, from an enthusiast perspective, running traceroute to the nearest google service (1e100.net for example) will already give you a huge tip on where things went wrong.
One thing that becomes apparent when you monitor diverse ISPs and endpoints this way is the inconsistency: in a normally-functioning situation, although most hops will have 0% loss, some will have absolutely any value from 0%–100%. The network I’m on at present has ten hops from _gateway to one.one.one.one; hop five is 100% loss, hop six varies around 40–50% loss, hop seven is about 60–62% loss, the rest are all 0% loss. It does host name lookup as well which can be a little bit useful for figuring out what’s probably local, probably ISP and probably public internet, but the boundaries are often a bit fuzzy.
mtr: <https://en.wikipedia.org/wiki/MTR_(software)>
1.1: short spelling of 1.0.0.1, the second address for Cloudflare’s 1.1.1.1 DNS server.
You can switch between the display modes with the d key, or start in this mode with MTR_OPTIONS=--displaymode=2 in the environment (which is how I do it, as it’s almost always what I want; if it weren’t, I’d probably make some kind of alias for `mtr --displaymode=2 1.1` instead).
Seeing packet loss in mtr is not entirely indicative of the health of the host. Some public servers filter out ICMP all together, and others add a firewall traffic shaping limit to the number of pings they reply to. You might be seeing that.