> I'm handling bugs very differently than network failures though, because network failures are usually temporary while bugs are usually (or even by definition) permanent.
Depends on the bug - there are transient bugs that are not networking related.
But let's assume it's a "hard error" ie: a consistently failing bug. I would say where that bug is makes a huge difference.
If it's a critical feature, that bug should probably get propagated. If it's a non-critical feature, maybe you can recover.
By isolating your state across a network boundary, recovery failure is made much simpler (because you do not need to unwind to a 'safe point' - the safe point is your network boundary).
But it often depends how you do it. I personally prefer to write microservices that use queues for the vast majority of interactions. This makes fault isolation particularly trivial (as you move retry logic to the queue) and it scales very well.
If you build lots of microservices with synchronous communications I think you'll run into a lot more complexity.
Still, I maintain that faults were already something to be handled, and that a network bound encourages better fault handling by effectively forcing it upon you.