When we started, our observability stack was composed of Riemann, Splunk and thousands of unloved cloudwatch alerts. We felt Riemann was very difficult to work with for our setup and our multiple shards, and we evaluated what options we had, finally choosing Prometheus.
We spend a lot of time pairing with squads so they can learn Prometheus and work on their own creating instrumentations and alerts.
We're in the middle of the rollout for Prometheus. Roughly 50% of the services and squads are using metrics and alerts, and this is on-track to becoming 90% until the end of the next quarter.
Having a metrics and alerting platform ready to use is making engineers much more confident on their services, but we're still facing some incidents that are reported by customer support instead of engineering tools, which could mean either we're missing something, or we just have to double down on the rollout to make sure everything is covered and alerting correctly.
Could you share your thoughts and experience around observability?