1
Do monitoring tools still miss early signals before incidents?
I'm curious how teams detect early signals of infrastructure problems before they turn into incidents.
In some environments I've seen cases where monitoring alerts arrive only after the system is already degrading.
Examples: - disk usage spikes faster than expected - network latency gradually increases - services degrade slowly before failing
Tools like Datadog, Zabbix, Prometheus etc. are great for alerts, but they still feel mostly reactive.
How do you deal with this in your infrastructure?
Do you rely more on: - anomaly detection - predictive monitoring - custom scripts - or just good incident response?
I'm trying to understand what actually works in real-world environments.