"For example, bigtable benefits from cache sharing and would prefer 100 % remote accesses to 50% remote. Search-frontend prefers spreading the threads to multiple caches to reduce cache contention and thus also prefers 100 % remote accesses to 50% remote."
Let me see if I've got this straight:
* bigtable benefits from scheduling related threads on the same cpu so they can share a cache, I'm guessing because multiple threads work on the same data simultaneously
* search benefits from having its threads spread over many cpus, probably because the threads are unrelated to each other and not sharing data, so they like to have their own caches
I'm not sure I understand how this relates to NUMA, or why remote accesses are ever a good thing. Maybe it requires a more sophisticated understanding of computer architecture than what I have.
The NUMA bit comes in when you said "scheduling related threads on the same cpu" and "threads spread over many cpus". If you schedule related threads on the same socket (cpu), you're more likely to get local accesses. If your threads share data, then that's two good things: local memory accesses, and good cache usage. But if your threads use different data, then the fact that you have local memory accesses may not matter because you may have many more cache misses because the threads are interfering with each other.
A simpler way to think about it: shorter access to main memory does not help you if you end up doing many more total accesses.
I'm pretty sure it goes without saying that 100% local is always better, assuming you're not trading anything else away (like accessible CPU on other nodes).
I sense that the article is saying things in confusing ways, perhaps because that's the way computer architects speak (it always struck me as counterintuitive and confusing to measure a cache by its miss rate rather than its hit rate) or maybe it's this article.
No technology is a 'silver bullet'. Every workload has a different set of considerations that require a different set of technology to optimize.
The way I read the outcome, NUMA seems to do what it's supposed to. The premise was that remote memory accesses are a performance killer, and forcing threads onto fewer cpus should be a big win. But NUMA came out looking pretty good. Leaving it alone looks like an excellent policy. Consider that google brought in a team of experts for the sole purpose of figuring out how to beat the default behavior of NUMA.
NUMA was 15% better for Gmail and 20% better for the Web search frontends, as indicated by the reductions (improvements) in CPI for these workloads.
There were some workloads where NUMA did degrade performance, such as BigTable accesses (12% regression).
I.e. the performance benefit from socket-local memory accesses may not be worth having all the threads using that memory on that socket's CPUs, because they'll each get too little a share of the cache.
(Than SMP systems, I guess, but the OP does not say.)