IMHO the main thing that determines this trend is whether the results are
good enough. For the most part, there's only some overlap between the people who work on better results and people who work on more efficient results, those research directions are driven by different needs and thus also tend to happen in different institutions.
As long as our proof of concept solutions don't yet solve the task appropriately, as long as the solution is weak and/or brittle and worse than what we need for the main partical applications, most of the research focus - and the research progress - will be on models that try and give better results. It makes sense to disregarding the compute cost and other impractical inconveniences when working on pushing the bleeding edge, trying to make the previously impossible things possible
However, when tasks are "solved" from the academic proof-of-concept perspective, then generally the practical, applied work on model efficiency can get huge reductions in computing power required. But that happens elsewhere.
The concept of technology readiness level (https://en.wikipedia.org/wiki/Technology_readiness_level) is relevant. For the NLP and CV technologies that are in TRL 3 or 4, the efficiency does not really matter as long as it fits in whatever computing clusters you can afford; this is mainly an issue for the widespread adoption of some tech in industry by the time the same tech is in TRL 6 or so, and this work mostly gets done by different people in different organizations with different funding sources than the initial TRL 3 research.