I'm talking about Netflix CDN servers. The workload is primarily file serving. The twist is that we use a non-NUMA aware OS (FreeBSD).
We're not latency sensitive at all. The problem we run into with NUMA is that we totally saturate QPI due to FreeBSD's lack of NUMA awareness.
The results you link to don't match with what we've seen on our HCC Broadwell CPUs, at least with COD disabled. Though we only really look at aggregate system bandwidth, so potentially the slowness accessing the "far" memory on the same socket is latency driven, and falls away in aggregate.