Today on a 24/48 core dual socket server you'll see the same sorts of thing, having a core using memory on the 'other' physical chip's memory bus will impact the overall performance significantly.
In the 2000's, before multi-core chips were a thing, there were two major camps, the 'super computer' camp, and the 'cluster' camp.
The 'super computer' camp insisted on cache coherent memory between all of the cores or threads. You got these very expensive fabrics from people like cray that would snoop access to memory from the cores and send coherency messages around to insure that if someone wrote something in to memory somewhere, everyone else's L1 or L2 cache got the message to invalidate what they were holding (shoot downs). These machines are very expensive and take months to build.
The 'cluster' camp said, "We can use a network fabric and just parameterize shared memory usage." So they put together independent machines connected by a network fabric and no cache coherency protocol. If you wanted to use shared memory you could build something like memcached and wrap your access with network calls. With that architecture even if it took twice as many cores to do what you wanted to do, the price of the machine was one tenth what it was for the big SMP machine.
For something that was trivially parallellizable like internet search or serving up web sites, that was a much more cost effective way to go. When people started doing stuff that they previously used 'super computers' and big SMP machines for on these Linux clusters it became a sort of race to pull apart these problems into "shared nothing" clusters.