1. You know there's a bug
2. You can't reproduce it
Several next steps come to mind:
1. Hire an outside expert who's dealt with this sort of thing before. They may be able to theorize what's going on and come up with a solution.
2. Install measures that don't prevent the problem but prevent the damage. For example, an emergency failsafe that shuts down the system or relieves the pressure when the incident occurs, thereby preventing the damage. This is why we electricity has fuseboxes! Error management is sometimes the only option, because 100% error prevention is impossible.
3. Install monitoring that tracks a lot more details then you are currently getting. When the next error occurs, you will know a lot more and may have the information needed.
Edit: What's the name of the theory in networking that 100% error prevention is not possible, so error handling is the only option? There was a great article on HN about it a few years back.
#1 is almost never a good solution, namely, the amount of time it would take for them to become familiar enough with the codebase to not aggravate your existing engineers would exceed several iterations of #3, and also because I've rarely met an outside expert whose solutions didn't involve re-writing everything to meet their expectations of "correct implementation," this could be a sample selection problem on my part, however.
* - How do you know that you are monitoring the correct component? This path usually leads to multiple monitoring development tasks as you find where you thought the problem was sourced was a in fact symptom, and you continue adding more monitoring options as you get closer to the source. This is why I almost always add an insane level of logging to any application, and control the verbosity through runtime controls.
I'm thinking too that there's a feedback process at work here, and some sort of damping mechanism would help with that.
Tortured turtles.
At this point I would do two things:
1. log all the things.
2. find me my top QA person, the one who can find bugs that nobody has yet reported. Put her on it.
OK, everybody knows that logging is good. And everyone knows that QA is good.
What I have found, though, is a number of companies who think that QA is best done by the developer who wrote the feature... and I think they are absolutely wrong in every sense, except possibly short-term economics. Having someone do QA who has none of their ego invested in the code is essential.
No bidets, though.