Monitoring helps, but unless your Ops staff knows what to do with a misbehaving database (RDBMS or other), it falls on the DBA or equivalent.
For example, perhaps the simplest solution would be to cron a script that checks 'df' output and sends an email as soon as you hit some reasonable threshold.
More complex but significantly more powerful is running something along the lines of Nagios to monitor not only disk usage, but a plethora of other systems level checks.
Once that road is walked it's not a big leap to start monitoring the application itself.
Why stop there? If you've got your metrics system (like Graphite) up and running, you can pull in raw metrics and trend your disk usage over time. Write a script that pulls in the raw data (add rawData=true to your parameters in Graphite) and then set thresholds on that. Have Graphite take the standard deviation of your disk metric and now you're alerting not only on an absolute threshold, but monitoring for sudden spikes in activity.
You may also very well be able to get "more complex" without your own infrastructure ... with the tradeoff being money and relying on 3rd party SaaS. There are pros and cons involved here.
Circle back, for a second, though. Putting in a complex solution that gives you the kitchen sink requires time and money. Nagios and Graphite are adding a layer of complexity that may be totally overblown for your needs at the moment. SaaS might not fit the bill. Right now may NOT be the time to go all crazy. So start simple. Get that cron job in place today, gain a little piece of mind, and then figure out what your next steps should be.
1. Set up an alert at a conservative usage to make sure nothing like this can happen
2. See alert and know you have plenty of time to fix the issue
3. Get distracted
4. Disk space disaster
We use AppFirst for our monitoring alerts. One thing they don't support is sending recurring alerts while something is over a threshold. They only send when thresholds are crossed.
Right now we're experimenting with PagerDuty reading the AppFirst alerts and then seeing it as an open issue.
I'd love to know what other people are using.
If you're not monitoring basic problems like disk utilization and RAM you're just asking for unnecessary downtime.
[1]: http://docs.amazonwebservices.com/AmazonCloudWatch/latest/De...
You'll quickly be annoyed into fixing it.
This worked wonders for me at countless instances
Please, more contrast between text and background. It's like reading through a haze.