Changed back to just using big resolvers and all those issues disappeared.
If you run your own recursive DNS server (I keep forgetting to use the right term) on a local network, you can hit the root servers directly, which makes that the most reliable possible DNS resolver. Yes you might get more cache misses initially but I highly doubt you'd notice. (note: querying the root nameservers is bad netiquette; you should always cache queries to them for at least 5 minutes, and always use DNS resolvers to cache locally)
I'd argue that accounting for poorly managed ISP resolvers is a critical part of reasoning about reliability.
In terms of my everyday usage, for the past couple of decades, cache miss delays are largely lost in the noise of stupidly huge WWW pages, artificial service greylisting delays, CAPTCHA delays, and so forth.
Especially as the first step in any full cache miss, a back-end query to the root content DNS server, is also just a round-trip over the loopback interface. Indeed, as is also the second step sometimes now, since some TLDs also let one mirror their data. Thank you, Estonia. https://news.ycombinator.com/item?id=44318136
And the gains in other areas are significant. Remember that privacy and security are also things that people want.
Then there's the fact that things like Quad9's/Google's/CloudFlare's anycasting surprisingly often results in hitting multiple independent servers for successive lookups, not yielding the cache gains that a superficial understanding would lead one to expect.
Just for fun, I did Bender's test at https://news.ycombinator.com/item?id=44534938 a couple of days ago, in a loop. I received reset-to-maximum TTLs from multiple successive cache misses, on queries spaced merely 10 seconds apart, from all three of Quad9, Google Public DNS, and CloudFlare 1.1.1.1. With some maths, I could probably make a good estimate as to how many separate anycast caches on those services are answering me from scratch, and not actually providing the cache hits that one would naïvely think would happen.
I added 127.0.0.1 to Bender's list, of course. That had 1 cache miss at the beginning and then hit the cache every single time, just counting down the TTL by 10 seconds each iteration of the loop; although it did decide that 42 days was unreasonably long, and reduced it to a week. (-: