A reasonable argument, although not one I seem to run up against, perhaps because I'm rarely concerned about high performance when writing functions with that many layers of subcalls.
By contrast, when trying to optimize inner loops, I frequently encounter cases where the front-end limitation of 4 micro-ops per cycle is a limiting factor, and getting rid of any extraneous instruction is a speedup. And rather than worrying about a deep stack causing L1 data misses, I'm more concerned with missing L1 instruction cache, or with the extra micro-ops causing me to miss the ~1000 slot decoded micro-op cache.
These concerns are clearly at opposite ends of the performance spectrum, and which should dominate probably depends on the problem at hand.
(I glanced at your comment history. Welcome to HN! You have good insights. Please stick around.)