I believe what is observed here is symptoms and not root cause.
My experience in this area tell me that after you go the "Micro Services" route, there is no coherent view of the system and a holistic design & architecture is derived from the many integration issues instead of trying to improve the data domain and it's inherent business challenges. So basically (over)engineering vs creating features..
I can't see how an academic could arrive to this conclusion unless he took part first hand in several organizations taking this route and contrasting this with first hand experience with a more "monolith" approach - or less emphasis on "micro-servicing-all-the-things"..
I’m not sure I follow the argument though.
Just because you have demonstrated that a system is scalable, and that it is tolerant of errors, does not imply it is tolerant of errors at scale.
The example given of Expedia’s error handling that, they claim, could have been verified without chaos testing:
> Expedia tested a simple fallback pattern where, when one dependent service is unavailable and returns an error, another service is contacted instead afterwards. There is no need to run this experiment in production by terminating servers in production: a simple test that mocks the response of the dependent service and returns a failure is sufficient.
When the first service becomes unavailable, does the alternate service have a cold cache? Does that drive increased timeouts and retries? Is there a hidden codependency of that service on the thing which caused the outage if the first service?
Maybe that can all be verified by independent non-chaos scalability testing of that service.
But chaos testing is like the integration testing over the units that individual service load and mock-error tests have verified. Sure, in theory this service fails over to calling a different dependency. And in theory that dependency is scalable.
Running a chaos test confirms that those assumptions are correct - that scalability + error tolerance actually delivers resilience.
> Most of this conversation can be obviated by spending time minimizing the number of systems, dependencies, vendors and other 3rd party items required to satisfy the product objectives. Prefer more "batteries-included" ecosystems when feasible.
> Start with a monolithic binary, SQLite and a single production host. Change this only when measurements and business requirements actually force you to. Plan for the possibility that you might have to expand to more than one production host, but don't prioritize it as an inevitability. There is no such thing as an executable that is "too big" when the alternative is sharding your circumstances to the 7 winds.