E.g. you have a script that does backups. You log the script's output, but one day something fails and the script is no longer executed.
Some form of dead man's handle is needed; the only way I can think of is to set up a monitoring service to check your log store for these entries every X hours.
Any alternatives?
I've toyed with the idea of writing a daily "sanity checker" in crontab that verifies various concepts of system health.
Examples: Did the latest batch of data transfer to S3? Did we delete old customer accounts today? Did we get any signups (because if not, something may be broken, but not triggering an exception report etc)? Did we send out daily report emails?
But I could see this easily becoming a pointless exercise, and I doubt I'd have the time to keep the sanity checker updated with the latest requirements. In fact, the sanity checker would probably become insane pretty quickly.
Perhaps the platform itself should do this for you, in some way. Idea: while coding, indicate that this procedure should be running periodically, ie:
Monitor.registerPeriodicTask('email-reports', 'daily')
and then the system would log every time it occurs, with a generic task that would run periodically and scan for things that should have occurred, but haven't in some while.Splunk DOES charge by the GB, but it's not very expensive in the long run.