My team is hosting an API that sends open telemetry data to Signoz & manages on-call via PagerDuty. We've configured Signoz to hit Pagerduty when there's a series of 500 errors.
However, our server went down last night and NO opentel data was sent to Signoz. We weren't notified that the server went down as there weren't 500 responses to report. What's the easiest way to have a cron-like query hit our API and integrate with our existing stack? Is this feasible with our existing vendors? Should I have a serverless function running on a timer that uses Pagerduty's API? Should I be migrating to another monitoring service?
Any advice would be appreciated!
#!/bin/bash
# Add this script to cron to run at whatever duration you desire.
# URL to be checked
URL="https://example.com/test.php"
# Email for alerts
EMAIL="root@example.com"
# Perform the HTTP request and extract the status code with 10 second timeout.
STATUS=$(curl -o /dev/null -s -w "%{http_code}\n" --max-time 10 $URL)
# Check if the status code is not 200
if [ "$STATUS" -ne 200 ]; then
# Send email alert
echo "The URL $URL did not return a 200 status code. Status was $STATUS." | mail -s "URL Check Alert" $EMAIL
# Instead of email, you could send a Slack/Teams/PagerDuty/Pushover/etc, etc alert, with something like:
curl -X POST https://events.pagerduty.com/...
fi
Edit: updated with suggested changes. # Send PagerDuty alert
curl -X POST https://events.pagerduty.com/...In fact, this is what I'd recommend (though I use Pushover), because then you don't have to be concerned with email setup, not getting caught in spam filters, firewalls, etc. You could also send a Slack/Teams alert with a similar POST.
For some reason, I thought I had read that OP wanted to send an email.
They usually all have pager duty integration as well.
Some examples:
Datadog: https://docs.datadoghq.com/synthetics/ Grafana cloud: https://grafana.com/grafana/plugins/grafana-synthetic-monito...
1. Yes, a synthetic check is very useful here to just see if a user facing "thing" is still working.
2. a heartbeat check / deadman's switch can also work here, but it will only be reliable when the monitored event has a predefined cadence, e.g. "every 5 minutes this should happen".
3. The lack / absence of metrics flowing into a system is also sign. This would typically be solved by the Signoz team where they would alert on not seeing some specific event happening for x time. This can be tricky if the event is directly related to a user interaction.
Super big disclaimer: founder at a monitoring company that solve 1 and 2, not 3.
Free and easy. Not affiliated, just a happy user.
I combine it with a service like uptimerobot to get messages if the heartbeat stopped.
For example: https://blog.ediri.io/how-to-set-up-a-dead-mans-switch-in-pr...
Thanks for your recommendations everyone! We decided to go the route of measuring successful hits on the endpoint associated with our docs (inside our monitoring service). That's the default health check associated with our load balancer, so it gets hit periodically (built-in cron job). We just added a signoz alert that is triggered if the sum of those successful calls over the past X seconds falls below a threshold.
Be aware that you’re still leaving quite a bit of surface area unmonitored: any sort of issue between your clients and the load balancer could break and not fire an alert. DNS configuration, firewall issue, SSL certificates, network outage, etc etc.
If you’re really trying to assert that some HTTP resource is publicly reachable, it’s still a good idea to have some external testing service periodically hitting your endpoint, running some assertions on the response, and alerting you if it’s failing for more than X minutes. (We do this; see my other reply.)
Hope that helps!
No 2XX-499 results means the server is borked.
If you wanted to be really fancy you could send heartbeat requests when no live traffic had hit that node for n milliseconds.
I still think the latter is the right answer. Why inflate traffic when you are horizontally scaling? That’s a lot of health checks per second.
You cannot only do classic heartbeat checks but also high level API (single request and multi request) and Browser checks to make sure your service is behaving as expected.
We have a decent free plan that would probably work for you.
there's probably functionality built in to your other monitoring tools, or you could write a little serverless function to to it, but for me i really like to have it isolated. i wanted a tool that's not part of the rest of our monitoring stack, not part of our own codebase, and not something we manage.
I wish they had a better way to hit your phone. Currently it supports Slack and SMS but I need a way to make my phone go crazy when things go wrong (you can set Updown as an emergency contact but this has many downsides as well).
Something that can happen is that your alerting system stops working. I wrote alertmanager-status to bridge Alertmanager to these website uptime checkers: https://github.com/jrockway/alertmanager-status Basically, Alertmanager gives it a fake alert every so often. If that arrives, then alertmanager-status reports healthy to your website upness checker. If it doesn't arrive in a given time window, then that reports unhealthy, and you'll get an alert that alerts don't work. I use this for my personal stuff and at work and ... Alertmanager has never broken, so it was a waste of time to write ;)
Not sure about Signoz and PagerDuty, but there are plenty freemium services like UpTimeRobot that work fine for basics.
And then something like AWS CloudWatch has a lot more advanced options about how to treat missing data.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...
With a lot of knobs to tune around what intervals you need and what counts as 'down', which you'll want to think about pretty carefully given what you need.
If you're on AWS, there's already heartbeat monitoring and that can integrate with CloudWatch to notify PagerDuty.
For example, AWS recommends using CloudWatch Synthetics to create a canary: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...
Synthetics are there to tell you when something outside of your provider is breaking your website, like routing errors, geo-related issues, CDN issues (assuming you're not using your provider's CDN).
The heartbeat process itself basically just sent pings every second on a UDP port and waited for a reply. If we didn’t get a reply in one second, it’d assume the connection was bad until a ping came again.
So I set the script to also write out a metric that was just the time stamp the metrics were last updated. Then it was simple to set up an alert in Prometheus - I can't access the config now so you'll have to look it up yourself, but it was basically "alert if metric less than now minus a time gap"
With a metric, you can use a monotonic counter to serve as a heartbeat. A timestamp would work. In your monitoring system, when the heartbeat value has not increased in X minutes, you alert.
I run multiple e-commerce websites where up time is critical and the types of errors could be anything. I use a service called Wormly that hits specific end points from multiple locations around the world. I'm not affiliated, just a happy customer.
Disclaimer: I'm the founder
I just configure it to access some endpoint on the API server. It checks it every minute and if it fails (non 200) it pings me.
You can also set it up in more complex ways but this is easy.
Write a cron job that greps the logs and pulls on your api with curl?
With another 20 minutes of work you can remember the log size on each run and grep only what's new on the next run.
Outbound Probes to hit your exposed HTTP services, or Inbound Liveness for your own cron jobs etc to check in.
edit: on second read, it sounds like regular uptime monitoring for your API would do the trick
There isn’t a good way to solve this using a PUSH model that isn’t somewhat of a hack or using another external tool
While my legal encumbrances prohibit helping you with actual code, I would recommend looking at watchdog processes.
For example, even a simple systemd periodic trigger that runs a script every minute that does general house keeping can work. i.e. a small script has the advantage of minimal library dependencies, fast/finite run state, and flexible behavior (checking cpu/network loads for nuisance traffic, and playing possum when a set threshold exceeded.)
Distributed systems is hard, but polling generally does not scale well (i.e. the cost of a self-check is locally fixed, but balloons on a cluster if going beyond a few fixed peer-checks).
And yeah, some clowns that think they are James Bond have been DDoS small boxes on Sun (they seem to be blindly hitting dns and ntp ports hard). We had to reset the tripwire on a 6 year old hobby host too.
tip, rate limiting firewall rules that expose whitelisted peers/admins to bandwidth guarantees is wise. Otherwise any cluster/heartbeats can be choked into a degraded state.
Don't take it personally, and enjoy a muffin with your tea. =3
You might want to use a different service for monitoring your stack just to make sure you have some redundancy there. Seems like you got an answer for how to do this with Signoz but if that is down then you won't know.
That's more or less what I was gonna say...