Ask HN: How to do simple heartbeat monitoring?

108 pointsbenjbrooks2y ago82 comments

Hey there!

My team is hosting an API that sends open telemetry data to Signoz & manages on-call via PagerDuty. We've configured Signoz to hit Pagerduty when there's a series of 500 errors.

However, our server went down last night and NO opentel data was sent to Signoz. We weren't notified that the server went down as there weren't 500 responses to report. What's the easiest way to have a cron-like query hit our API and integrate with our existing stack? Is this feasible with our existing vendors? Should I have a serverless function running on a timer that uses Pagerduty's API? Should I be migrating to another monitoring service?

Any advice would be appreciated!

82 comments

runjake2y ago

If you want super minimal, something like this might work?

  #!/bin/bash

  # Add this script to cron to run at whatever duration you desire.

  # URL to be checked
  URL="https://example.com/test.php"

  # Email for alerts
  EMAIL="root@example.com"

  # Perform the HTTP request and extract the status code with 10 second timeout.
  STATUS=$(curl -o /dev/null -s -w "%{http_code}\n" --max-time 10 $URL)

  # Check if the status code is not 200
  if [ "$STATUS" -ne 200 ]; then
      # Send email alert
      echo "The URL $URL did not return a 200 status code. Status was $STATUS." | mail -s "URL Check Alert" $EMAIL

      # Instead of email, you could send a Slack/Teams/PagerDuty/Pushover/etc, etc alert, with something like:
      curl -X POST https://events.pagerduty.com/...
  fi

Edit: updated with suggested changes.

divbzero2y ago

As a slight variation, you could send an alert to the PagerDuty API by replacing “Send email alert” with something like:

  # Send PagerDuty alert
  curl -X POST https://events.pagerduty.com/...

runjake2y ago

Added.

In fact, this is what I'd recommend (though I use Pushover), because then you don't have to be concerned with email setup, not getting caught in spam filters, firewalls, etc. You could also send a Slack/Teams alert with a similar POST.

For some reason, I thought I had read that OP wanted to send an email.

zikduruqe2y ago

Or apprise for a whole bunch of notifications.

https://github.com/caronc/apprise

drivers992y ago

You should include a timeout in the curl to detect if it hangs, or if it gets slower than it should be.

runjake2y ago

Added!

azeemba2y ago

A lot of vendors offer this and call it "synthetic monitoring". They will repeatedly send requests that you configure and record the success rate.

They usually all have pager duty integration as well.

Some examples:

Datadog: https://docs.datadoghq.com/synthetics/ Grafana cloud: https://grafana.com/grafana/plugins/grafana-synthetic-monito...

tnolet2y ago

We do synthetics and heartbeat monitoring for quite some companies (link in bio) but this problem is a bit trickier, or let's say "three sided"

1. Yes, a synthetic check is very useful here to just see if a user facing "thing" is still working.

2. a heartbeat check / deadman's switch can also work here, but it will only be reliable when the monitored event has a predefined cadence, e.g. "every 5 minutes this should happen".

3. The lack / absence of metrics flowing into a system is also sign. This would typically be solved by the Signoz team where they would alert on not seeing some specific event happening for x time. This can be tricky if the event is directly related to a user interaction.

Super big disclaimer: founder at a monitoring company that solve 1 and 2, not 3.

Gys2y ago

https://heartbeat.sh/

Free and easy. Not affiliated, just a happy user.

I combine it with a service like uptimerobot to get messages if the heartbeat stopped.

cweagans2y ago

You're looking for a dead man's switch. https://deadmanssnitch.com is a good hosted service or Uptime Kuma (https://github.com/louislam/uptime-kuma) can be configured to do the same thing.

1-more2y ago

+1 for Deadman's Snitch. We use that at work. It's groovy.

ravenstine2y ago

Haven't needed to use it in a long time, but many years ago I used Dead Man's Snitch and it's a solid service. :thumbsup:

what22y ago

You need to implement a deadman switch. For example if using Prometheus you can configure it to access an HTTP endpoint of a deadman switch service every X seconds. When that service detects you have not accessed it in some time it will alert you.

For example: https://blog.ediri.io/how-to-set-up-a-dead-mans-switch-in-pr...

fullstop2y ago

And now you need to monitor the deadman switch service!

Joker_vD2y ago

Oh, you just run two instances of those and point them at each other.

jasongill2y ago

And then a third instance in case those both go down at the same time, and a fourth just in case there's a major worldwide outage.... it's monitoring instances all the way down

hinkley2y ago

Even number so that everything is symmetrical.

1 more reply

Spivak2y ago

Typically that's why people use dead man's switch as a service. You don't assume that they'll never go down, but you're paying for someone who's failures are very likely uncorrelated with your own.

benjbrooksOP2y ago

Update:

Thanks for your recommendations everyone! We decided to go the route of measuring successful hits on the endpoint associated with our docs (inside our monitoring service). That's the default health check associated with our load balancer, so it gets hit periodically (built-in cron job). We just added a signoz alert that is triggered if the sum of those successful calls over the past X seconds falls below a threshold.

compumike2y ago

> That's the default health check associated with our load balancer, so it gets hit periodically (built-in cron job).

Be aware that you’re still leaving quite a bit of surface area unmonitored: any sort of issue between your clients and the load balancer could break and not fire an alert. DNS configuration, firewall issue, SSL certificates, network outage, etc etc.

If you’re really trying to assert that some HTTP resource is publicly reachable, it’s still a good idea to have some external testing service periodically hitting your endpoint, running some assertions on the response, and alerting you if it’s failing for more than X minutes. (We do this; see my other reply.)

Hope that helps!

guzik2y ago

I can highly recommend Better Stack. We have never been let down by their service.

blakeburch2y ago

Seconding this. We've used them for a little over a year now for heartbeats + external status dashboard. Easy platform to set up and maintain. Decent pricing.

benjbrooksOP2y ago

thanks! this has been recommended by others. I'm not feeling excited about migrating to another monitoring service but it might be the right move.

dvlsg2y ago

Depending on your expected traffic patterns and volume, no responses to report for an extended period of time is its own data point.

hinkley2y ago

In the old days when telemetry was exotic this was the answer.

No 2XX-499 results means the server is borked.

If you wanted to be really fancy you could send heartbeat requests when no live traffic had hit that node for n milliseconds.

I still think the latter is the right answer. Why inflate traffic when you are horizontally scaling? That’s a lot of health checks per second.

foobarqux2y ago

Healthchecks.io is very simple to use and free for low usage (including 5 free sms per month)

geor9e2y ago

I came here ready to pontificate on the wild world of signal processing for electrocardiography and photoplethysmography sensors. Nevermind.

roboben2y ago

https://checklyhq.com

You cannot only do classic heartbeat checks but also high level API (single request and multi request) and Browser checks to make sure your service is behaving as expected.

winrid2y ago

I've been using UptimeRobot since 2019 for FastComments, relatively happy. They have a PagerDuty integration, although I just use the built in text/email alerts ATM.

encoderer2y ago

This basic monitoring primitive is the first thing we started with at Cronitor[1]. I was a software engineer at my day job and needed a way to be alerted when something doesn't happen.

We have a decent free plan that would probably work for you.

[1] https://news.ycombinator.com/item?id=7917587

notatoad2y ago

we use updown.io for this and are happy with it.

there's probably functionality built in to your other monitoring tools, or you could write a little serverless function to to it, but for me i really like to have it isolated. i wanted a tool that's not part of the rest of our monitoring stack, not part of our own codebase, and not something we manage.

ericpauley2y ago

Seconding Updown. Easy setup, pay-as-you-go, no nonsense.

I wish they had a better way to hit your phone. Currently it supports Slack and SMS but I need a way to make my phone go crazy when things go wrong (you can set Updown as an emergency contact but this has many downsides as well).

1 more reply

jrockway2y ago

This is normal website monitoring. There are a billion services that do this. The cloud providers have them. I use Oh Dear.

Something that can happen is that your alerting system stops working. I wrote alertmanager-status to bridge Alertmanager to these website uptime checkers: https://github.com/jrockway/alertmanager-status Basically, Alertmanager gives it a fake alert every so often. If that arrives, then alertmanager-status reports healthy to your website upness checker. If it doesn't arrive in a given time window, then that reports unhealthy, and you'll get an alert that alerts don't work. I use this for my personal stuff and at work and ... Alertmanager has never broken, so it was a waste of time to write ;)

sixhobbits2y ago

This is one of those things that _seems_ really simple but the details make it pretty complicated.

Not sure about Signoz and PagerDuty, but there are plenty freemium services like UpTimeRobot that work fine for basics.

And then something like AWS CloudWatch has a lot more advanced options about how to treat missing data.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

With a lot of knobs to tune around what intervals you need and what counts as 'down', which you'll want to think about pretty carefully given what you need.

et-al2y ago

> Should I have a serverless function running on a timer that uses Pagerduty's API?

If you're on AWS, there's already heartbeat monitoring and that can integrate with CloudWatch to notify PagerDuty.

divbzero2y ago

If this is a cloud deployment, there might be a recommended method native to your cloud if you search for <your cloud provider> synthetic monitoring or <your cloud provider> canary.

For example, AWS recommends using CloudWatch Synthetics to create a canary: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

hatsix2y ago

Before resorting to Synthetics, you can use Cloudwatch to setup anomaly reporting on the load balancer. If your requests drop to 0 in 5/10/15 minutes, alert... If you have more than N 500s in 5/10/15 minutes, alert.

Synthetics are there to tell you when something outside of your provider is breaking your website, like routing errors, geo-related issues, CDN issues (assuming you're not using your provider's CDN).

devoutsalsa2y ago

Writing a heartbeat process using UDP for a Fluentd logger is as one of the first things I learned in Elixir! In my case, basically when then there was no heartbeat, the TCP forwarder /w a connection pool would be taken offline. When the heartbeat came back, the forwarder was free to send messages again.

The heartbeat process itself basically just sent pings every second on a UDP port and waited for a reply. If we didn’t get a reply in one second, it’d assume the connection was bad until a ping came again.

jarofgreen2y ago

I had a similar concern about a custom monitoring script that was regularly writing out a metrics file in Prometheus format to be picked up by the Prometheus node monitor. What if the script breaks?

So I set the script to also write out a metric that was just the time stamp the metrics were last updated. Then it was simple to set up an alert in Prometheus - I can't access the config now so you'll have to look it up yourself, but it was basically "alert if metric less than now minus a time gap"

nicbou2y ago

I've been using it for some time and it works well.

ttymck2y ago

For my applications, monitored by prometheus + grafana, we have alerts when no data is reported for certain metrics in the past 5 minutes, indicating a malfunction in the subsystem.

With a metric, you can use a monotonic counter to serve as a heartbeat. A timestamp would work. In your monitoring system, when the heartbeat value has not increased in X minutes, you alert.

AH4oFVbPT4f82y ago

https://www.wormly.com

I run multiple e-commerce websites where up time is critical and the types of errors could be anything. I use a service called Wormly that hits specific end points from multiple locations around the world. I'm not affiliated, just a happy customer.

anikdas2y ago

We have been using Cloudflare health check at work for this and we have been pretty happy with this. We have this integrated with slack and opsgenie as well.

https://developers.cloudflare.com/health-checks/

jurajmasar2y ago

Better Stack offers heartbeat monitoring for free, here's the docs: https://betterstack.com/docs/uptime/cron-and-heartbeat-monit...

Disclaimer: I'm the founder

nicky02y ago

I use Uptime Robot for this: https://uptimerobot.com/

I just configure it to access some endpoint on the API server. It checks it every minute and if it fails (non 200) it pings me.

You can also set it up in more complex ways but this is easy.

kraig9112y ago

Back in the day I had a script that would telnet once every so often to a configured time frame and listen on each one. Then another script monitoring the number of each and if the average over 10 mins was less than 90% each node send a message in your manner of choice (email or something)

eps2y ago

> What's the easiest way to have a cron-like query hit our API and integrate with our existing stack?

Write a cron job that greps the logs and pulls on your api with curl?

With another 20 minutes of work you can remember the log size on each run and grep only what's new on the next run.

hatthew2y ago

Not familiar with Signoz, but a simple solution should be to check if the total number of requests in the past X duration is <= Y, where X depends on how much traffic you see and Y is some threshold such as 1, 1000, or min(yesterday, last_week).

compumike2y ago

We do this at Heii On-Call: https://heiioncall.com/

Outbound Probes to hit your exposed HTTP services, or Inbound Liveness for your own cron jobs etc to check in.

tomsun282y ago

Maybe can try use opensource project apache hertzbeat to monitoring heartbeat. https://github.com/apache/hertzbeat

rozenmd2y ago

OnlineOrNot has a cron job/scheduled task monitoring system that integrates with PagerDuty fwiw

edit: on second read, it sounds like regular uptime monitoring for your API would do the trick

renewiltord2y ago

All monitoring systems have a no-data condition. Use that.

seper82y ago

I use grafana. Setup an endpoint on the backend that uses things like the database to check if it can still connect. Every 10 sec. If it fails I get a text.

Atotalnoob2y ago

To me, this is the biggest problem with OpenTelemetry.

There isn’t a good way to solve this using a PUSH model that isn’t somewhat of a hack or using another external tool

GauntletWizard2y ago

Prometheus makes this a basic part of monitoring, "up" is a synthetic metric based on whether or not metrics gathering succeeds.

nprateem2y ago

Invert the flux capacitor so it fires an alert if it doesn't receive a heartbeat. This should be a separate alert to your 500 one.

papa_autist2y ago

healthchecks.io

johtso2y ago

Agreed, I've used this for a number of years and have been happy with it, very straightforward free service.

lmeyerov2y ago

We have two different external uptime status monitoring services, which has helped

Joel_Mckay2y ago

In general the "are-you-alive" messages are redundant, as the data exchange messages serves the same purpose.

While my legal encumbrances prohibit helping you with actual code, I would recommend looking at watchdog processes.

For example, even a simple systemd periodic trigger that runs a script every minute that does general house keeping can work. i.e. a small script has the advantage of minimal library dependencies, fast/finite run state, and flexible behavior (checking cpu/network loads for nuisance traffic, and playing possum when a set threshold exceeded.)

Distributed systems is hard, but polling generally does not scale well (i.e. the cost of a self-check is locally fixed, but balloons on a cluster if going beyond a few fixed peer-checks).

And yeah, some clowns that think they are James Bond have been DDoS small boxes on Sun (they seem to be blindly hitting dns and ntp ports hard). We had to reset the tripwire on a 6 year old hobby host too.

tip, rate limiting firewall rules that expose whitelisted peers/admins to bandwidth guarantees is wise. Otherwise any cluster/heartbeats can be choked into a degraded state.

Don't take it personally, and enjoy a muffin with your tea. =3

devneelpatel2y ago

OneUptime.com does this for you and is 100% open-source.

liveoneggs2y ago

statuscake & new relic both have generous free tiers

g8oz2y ago

I've heard good things about UptimeRobot.

hasante2y ago

Google openstatus.dev - might help you out.

nsguy2y ago

I clicked to say something about Instrumentation Amplifiers or interfacing with chest straps ;)

You might want to use a different service for monitoring your stack just to make sure you have some redundancy there. Seems like you got an answer for how to do this with Signoz but if that is down then you won't know.

questionQuest2y ago

What did you want to say about instrumentation amplifiers / chest straps? :eyes:

nsguy2y ago

In pre-historic times I (EDIT: designed and) built a battery powered heart rate data logger for livestock. I used an instrumentation amplifier IC and some very simple electrodes and was able to use auto-correlation to detect the heart rate pretty reliably. For humans today I'd pick something like a Garmin chest strap which works with Bluetooth or ANT+. I've no direct experience with ANT or ANT+ but I imagine it's not hard to interface with ;)

That's more or less what I was gonna say...

ravenstine2y ago

Same, honestly, though not specifically about chest straps. Extremely primitive but effective heartbeat monitoring can be done with a cheap SpO2 monitor from Amazon and reverse engineering the Bluetooth LE signal. You won't get a true beat, but one that's more of an approximation. I know I have some C++ code lying around somewhere for this.

numpad02y ago

Could be done with a webcam even. It's known that facial skin color is affected subtly by pulses. There are plenty Python codes for that on GitHub.

kulahan2y ago

Even full-blown OEM heartrate monitors seem to be relatively ineffective (at least, in my experience this is true when talking about watches specifically). I can't imagine this would do a great job, would it?

silverquiet2y ago

Yeah, honestly, I'd like a smartwatch type monitor without any smart features; just something that sits on my wrist and displays my heart rate so I can watch it myself. I've looked a bit, but such a thing hasn't been easy to find.

brabarossa2y ago

Fitbit Sense 2 is pretty dumb for a smart watch, but Polar H10 should be more accurate.

knaik942y ago

The polar verity sense is what you're looking for. It connects to your phone via ble so it doesn't need a display. It's not as popular as a chest strap.

fuzzythinker2y ago

The Polar Pacer or Pracer Pro's only smart features are for exercises/etc. and I don't really use them, so it's really what you describe.

jeffbarg2y ago

Believe you can do this in Signoz: https://signoz.io/docs/monitor-http-endpoints/

benjbrooksOP2y ago

ah, this is the answer I was hoping for. thank you!

benjbrooksOP2y ago

on second read, this is something that is set up on the service-side. I.e. if our service goes down, we would also stop sending data to Signoz Cloud. We wouldn't be able to run a status check, let alone send it to signoz cloud.

ankit01-oss2y ago

hey Signoz maintainer here. You can set up alerts to send notifications when there is data missing from your services for configurable periods of time. For example, alert if the data goes missing for the last 5 minutes. Yet to update this in docs, but please find some info here: https://signoz.io/blog/community-update-35/#upgraded-functio...

ankitnayan2y ago

you can add https://github.com/open-telemetry/opentelemetry-collector-co... at signoz's otel-collector which will scrape your service's endpoint periodically. If your service is down, this will give 5xx error and you can set an alert on that.

Another alternative is to use an alert to notify on a metric being absent for sometime. Both of these should work

beeboobaa32y ago

Just have your monitoring system hit some health check endpoint of yours to verify everything is reachable & healthy.

j / k navigate · click thread line to collapse

82 comments

runjake2y ago

If you want super minimal, something like this might work?

  #!/bin/bash

  # Add this script to cron to run at whatever duration you desire.

  # URL to be checked
  URL="https://example.com/test.php"

  # Email for alerts
  EMAIL="root@example.com"

  # Perform the HTTP request and extract the status code with 10 second timeout.
  STATUS=$(curl -o /dev/null -s -w "%{http_code}\n" --max-time 10 $URL)

  # Check if the status code is not 200
  if [ "$STATUS" -ne 200 ]; then
      # Send email alert
      echo "The URL $URL did not return a 200 status code. Status was $STATUS." | mail -s "URL Check Alert" $EMAIL

      # Instead of email, you could send a Slack/Teams/PagerDuty/Pushover/etc, etc alert, with something like:
      curl -X POST https://events.pagerduty.com/...
  fi

Edit: updated with suggested changes.

divbzero2y ago

As a slight variation, you could send an alert to the PagerDuty API by replacing “Send email alert” with something like:

  # Send PagerDuty alert
  curl -X POST https://events.pagerduty.com/...

runjake2y ago

Added.

For some reason, I thought I had read that OP wanted to send an email.

zikduruqe2y ago

Or apprise for a whole bunch of notifications.

https://github.com/caronc/apprise

drivers992y ago

You should include a timeout in the curl to detect if it hangs, or if it gets slower than it should be.

runjake2y ago

Added!

azeemba2y ago

A lot of vendors offer this and call it "synthetic monitoring". They will repeatedly send requests that you configure and record the success rate.

They usually all have pager duty integration as well.

Some examples:

Datadog: https://docs.datadoghq.com/synthetics/ Grafana cloud: https://grafana.com/grafana/plugins/grafana-synthetic-monito...

tnolet2y ago

We do synthetics and heartbeat monitoring for quite some companies (link in bio) but this problem is a bit trickier, or let's say "three sided"

1. Yes, a synthetic check is very useful here to just see if a user facing "thing" is still working.

2. a heartbeat check / deadman's switch can also work here, but it will only be reliable when the monitored event has a predefined cadence, e.g. "every 5 minutes this should happen".

Super big disclaimer: founder at a monitoring company that solve 1 and 2, not 3.

Gys2y ago

https://heartbeat.sh/

Free and easy. Not affiliated, just a happy user.

I combine it with a service like uptimerobot to get messages if the heartbeat stopped.

cweagans2y ago

You're looking for a dead man's switch. https://deadmanssnitch.com is a good hosted service or Uptime Kuma (https://github.com/louislam/uptime-kuma) can be configured to do the same thing.

1-more2y ago

+1 for Deadman's Snitch. We use that at work. It's groovy.

ravenstine2y ago

Haven't needed to use it in a long time, but many years ago I used Dead Man's Snitch and it's a solid service. :thumbsup:

what22y ago

For example: https://blog.ediri.io/how-to-set-up-a-dead-mans-switch-in-pr...

fullstop2y ago

And now you need to monitor the deadman switch service!

Joker_vD2y ago

Oh, you just run two instances of those and point them at each other.

jasongill2y ago

And then a third instance in case those both go down at the same time, and a fourth just in case there's a major worldwide outage.... it's monitoring instances all the way down

hinkley2y ago

Even number so that everything is symmetrical.

1 more reply

Spivak2y ago

Typically that's why people use dead man's switch as a service. You don't assume that they'll never go down, but you're paying for someone who's failures are very likely uncorrelated with your own.

benjbrooksOP2y ago

Update:

compumike2y ago

> That's the default health check associated with our load balancer, so it gets hit periodically (built-in cron job).

Hope that helps!

guzik2y ago

I can highly recommend Better Stack. We have never been let down by their service.

blakeburch2y ago

Seconding this. We've used them for a little over a year now for heartbeats + external status dashboard. Easy platform to set up and maintain. Decent pricing.

benjbrooksOP2y ago

thanks! this has been recommended by others. I'm not feeling excited about migrating to another monitoring service but it might be the right move.

dvlsg2y ago

Depending on your expected traffic patterns and volume, no responses to report for an extended period of time is its own data point.

hinkley2y ago

In the old days when telemetry was exotic this was the answer.

No 2XX-499 results means the server is borked.

If you wanted to be really fancy you could send heartbeat requests when no live traffic had hit that node for n milliseconds.

I still think the latter is the right answer. Why inflate traffic when you are horizontally scaling? That’s a lot of health checks per second.

foobarqux2y ago

Healthchecks.io is very simple to use and free for low usage (including 5 free sms per month)

geor9e2y ago

I came here ready to pontificate on the wild world of signal processing for electrocardiography and photoplethysmography sensors. Nevermind.

roboben2y ago

https://checklyhq.com

You cannot only do classic heartbeat checks but also high level API (single request and multi request) and Browser checks to make sure your service is behaving as expected.

winrid2y ago

I've been using UptimeRobot since 2019 for FastComments, relatively happy. They have a PagerDuty integration, although I just use the built in text/email alerts ATM.

encoderer2y ago

This basic monitoring primitive is the first thing we started with at Cronitor[1]. I was a software engineer at my day job and needed a way to be alerted when something doesn't happen.

We have a decent free plan that would probably work for you.

[1] https://news.ycombinator.com/item?id=7917587

notatoad2y ago

we use updown.io for this and are happy with it.

ericpauley2y ago

Seconding Updown. Easy setup, pay-as-you-go, no nonsense.

1 more reply

jrockway2y ago

This is normal website monitoring. There are a billion services that do this. The cloud providers have them. I use Oh Dear.

sixhobbits2y ago

This is one of those things that _seems_ really simple but the details make it pretty complicated.

Not sure about Signoz and PagerDuty, but there are plenty freemium services like UpTimeRobot that work fine for basics.

And then something like AWS CloudWatch has a lot more advanced options about how to treat missing data.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

With a lot of knobs to tune around what intervals you need and what counts as 'down', which you'll want to think about pretty carefully given what you need.

et-al2y ago

> Should I have a serverless function running on a timer that uses Pagerduty's API?

If you're on AWS, there's already heartbeat monitoring and that can integrate with CloudWatch to notify PagerDuty.

divbzero2y ago

If this is a cloud deployment, there might be a recommended method native to your cloud if you search for <your cloud provider> synthetic monitoring or <your cloud provider> canary.

For example, AWS recommends using CloudWatch Synthetics to create a canary: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

hatsix2y ago

Synthetics are there to tell you when something outside of your provider is breaking your website, like routing errors, geo-related issues, CDN issues (assuming you're not using your provider's CDN).

devoutsalsa2y ago

jarofgreen2y ago

I had a similar concern about a custom monitoring script that was regularly writing out a metrics file in Prometheus format to be picked up by the Prometheus node monitor. What if the script breaks?

nicbou2y ago

I've been using it for some time and it works well.

ttymck2y ago

For my applications, monitored by prometheus + grafana, we have alerts when no data is reported for certain metrics in the past 5 minutes, indicating a malfunction in the subsystem.

With a metric, you can use a monotonic counter to serve as a heartbeat. A timestamp would work. In your monitoring system, when the heartbeat value has not increased in X minutes, you alert.

AH4oFVbPT4f82y ago

https://www.wormly.com

anikdas2y ago

We have been using Cloudflare health check at work for this and we have been pretty happy with this. We have this integrated with slack and opsgenie as well.

https://developers.cloudflare.com/health-checks/

jurajmasar2y ago

Better Stack offers heartbeat monitoring for free, here's the docs: https://betterstack.com/docs/uptime/cron-and-heartbeat-monit...

Disclaimer: I'm the founder

nicky02y ago

I use Uptime Robot for this: https://uptimerobot.com/

I just configure it to access some endpoint on the API server. It checks it every minute and if it fails (non 200) it pings me.

You can also set it up in more complex ways but this is easy.

kraig9112y ago

eps2y ago

> What's the easiest way to have a cron-like query hit our API and integrate with our existing stack?

Write a cron job that greps the logs and pulls on your api with curl?

With another 20 minutes of work you can remember the log size on each run and grep only what's new on the next run.

hatthew2y ago

compumike2y ago

We do this at Heii On-Call: https://heiioncall.com/

Outbound Probes to hit your exposed HTTP services, or Inbound Liveness for your own cron jobs etc to check in.

tomsun282y ago

Maybe can try use opensource project apache hertzbeat to monitoring heartbeat. https://github.com/apache/hertzbeat

rozenmd2y ago

OnlineOrNot has a cron job/scheduled task monitoring system that integrates with PagerDuty fwiw

edit: on second read, it sounds like regular uptime monitoring for your API would do the trick

renewiltord2y ago

All monitoring systems have a no-data condition. Use that.

seper82y ago

I use grafana. Setup an endpoint on the backend that uses things like the database to check if it can still connect. Every 10 sec. If it fails I get a text.

Atotalnoob2y ago

To me, this is the biggest problem with OpenTelemetry.

There isn’t a good way to solve this using a PUSH model that isn’t somewhat of a hack or using another external tool

GauntletWizard2y ago

Prometheus makes this a basic part of monitoring, "up" is a synthetic metric based on whether or not metrics gathering succeeds.

nprateem2y ago

Invert the flux capacitor so it fires an alert if it doesn't receive a heartbeat. This should be a separate alert to your 500 one.

papa_autist2y ago

healthchecks.io

johtso2y ago

Agreed, I've used this for a number of years and have been happy with it, very straightforward free service.

lmeyerov2y ago

We have two different external uptime status monitoring services, which has helped

Joel_Mckay2y ago

In general the "are-you-alive" messages are redundant, as the data exchange messages serves the same purpose.

While my legal encumbrances prohibit helping you with actual code, I would recommend looking at watchdog processes.

Distributed systems is hard, but polling generally does not scale well (i.e. the cost of a self-check is locally fixed, but balloons on a cluster if going beyond a few fixed peer-checks).

tip, rate limiting firewall rules that expose whitelisted peers/admins to bandwidth guarantees is wise. Otherwise any cluster/heartbeats can be choked into a degraded state.

Don't take it personally, and enjoy a muffin with your tea. =3

devneelpatel2y ago

OneUptime.com does this for you and is 100% open-source.

liveoneggs2y ago

statuscake & new relic both have generous free tiers

g8oz2y ago

I've heard good things about UptimeRobot.

hasante2y ago

Google openstatus.dev - might help you out.

nsguy2y ago

I clicked to say something about Instrumentation Amplifiers or interfacing with chest straps ;)

questionQuest2y ago

What did you want to say about instrumentation amplifiers / chest straps? :eyes:

nsguy2y ago

That's more or less what I was gonna say...

ravenstine2y ago

numpad02y ago

Could be done with a webcam even. It's known that facial skin color is affected subtly by pulses. There are plenty Python codes for that on GitHub.

kulahan2y ago

silverquiet2y ago

brabarossa2y ago

Fitbit Sense 2 is pretty dumb for a smart watch, but Polar H10 should be more accurate.

knaik942y ago

The polar verity sense is what you're looking for. It connects to your phone via ble so it doesn't need a display. It's not as popular as a chest strap.

fuzzythinker2y ago

The Polar Pacer or Pracer Pro's only smart features are for exercises/etc. and I don't really use them, so it's really what you describe.

jeffbarg2y ago

Believe you can do this in Signoz: https://signoz.io/docs/monitor-http-endpoints/

benjbrooksOP2y ago

ah, this is the answer I was hoping for. thank you!

benjbrooksOP2y ago

ankit01-oss2y ago

ankitnayan2y ago

Another alternative is to use an alert to notify on a metric being absent for sometime. Both of these should work

beeboobaa32y ago

Just have your monitoring system hit some health check endpoint of yours to verify everything is reachable & healthy.

j / k navigate · click thread line to collapse