In such cases, I use "torture" test cases: lengthy, random test cases, which are trying to abuse and overload system with no data, incorrect data, random data, huge data, high latencies, duplicated messages, missed messages or random aborts, random speaks, etc. They allows me to discover situations not covered by test cases. I also try to use underpowered hardware for such testing. Of course, I cannot imagine all possible torture scenarios, but I saw lot of bug and security reports, so I still know lot of scenarios, more than I willing to write tests for.
That's a good approach as well! In addition, nowadays with VMs/containers you can even simulate nodes going randomly up and down, which is a bit of a challenge if you do it in a real testing cluster.