undefined | Better HN

0 pointsChuckMcM8y ago0 comments

No, I was thinking non-uniform memory architecture, which is to say an SMP machine where the "speed" at which you can access RAM is dependent on the core or 'thread' from which you accessed it. Lack of memory uniformity was the compromise to achieve larger effective address spaces and "simple" programming.

Today on a 24/48 core dual socket server you'll see the same sorts of thing, having a core using memory on the 'other' physical chip's memory bus will impact the overall performance significantly.

In the 2000's, before multi-core chips were a thing, there were two major camps, the 'super computer' camp, and the 'cluster' camp.

The 'super computer' camp insisted on cache coherent memory between all of the cores or threads. You got these very expensive fabrics from people like cray that would snoop access to memory from the cores and send coherency messages around to insure that if someone wrote something in to memory somewhere, everyone else's L1 or L2 cache got the message to invalidate what they were holding (shoot downs). These machines are very expensive and take months to build.

The 'cluster' camp said, "We can use a network fabric and just parameterize shared memory usage." So they put together independent machines connected by a network fabric and no cache coherency protocol. If you wanted to use shared memory you could build something like memcached and wrap your access with network calls. With that architecture even if it took twice as many cores to do what you wanted to do, the price of the machine was one tenth what it was for the big SMP machine.

For something that was trivially parallellizable like internet search or serving up web sites, that was a much more cost effective way to go. When people started doing stuff that they previously used 'super computers' and big SMP machines for on these Linux clusters it became a sort of race to pull apart these problems into "shared nothing" clusters.

0 comments

bogomipz8y ago

>"Lack of memory uniformity was the compromise to achieve larger effective address spaces and "simple" programming."

Oh interesting. Might you have any links or suggested reading on this discussion and eventual compromise?

>"In the 2000's, before multi-core chips were a thing, there were two major camps, the 'super computer' camp, and the 'cluster' camp."

Is the cluster camp Beowulf then basically?

>"You got these very expensive fabrics from people like cray that would snoop access to memory from the cores and send coherency messages around to insure that if someone wrote something in to memory somewhere, everyone else's L1 or L2 cache got the message to invalidate what they were holding (shoot downs)."

Is this the MESI protocol then?

Thanks.

EDIT I just saw the link below about CCNUMA.

zurn8y ago

A big category of cluster-friendly HPC cluster codes works by running in lockstep on all the nodes. Apparently it works quite well for things like weather forecasting where the problem is naturally divided into a grid but there is still significant communication needed between grid tiles. So it's quite different from memcached type things.

https://www.cs.fsu.edu/~engelen/courses/HPC/Synchronous.pdf

godelmachine8y ago

A quick Google search after reading this comment of yours gave me this → http://www.google.com/patents/US5887146

ChuckMcMOP8y ago

Ah yes the CCNUMA patent. This bit is the relevant part:

However, SMP systems suffer disadvantages in that system bandwidth and scalability are limited. Although multiprocessor systems may be capable of executing many millions of instructions per second, the shared memory resources and the system bus connecting the multiprocessors to the memory presents a bottleneck as complex processing loads are spread among more processors, each needing access to the global memory. As the complexity of software running on SMP's increases, resulting in a need for more processors in a system to perform complex tasks or portions thereof, the demand for memory access increases accordingly. Thus more processors does not necessarily translate into faster processing, i.e. typical SMP systems are not scalable. That is, processing performance actually decreases at some point as more processors are added to the system to process more complex tasks. The decrease in performance is due to the bottleneck created by the increased number of processors needing access to the memory and the transport mechanism, e.g. bus, to and from memory.

Alternative architectures are known which seek to relieve the bandwidth bottleneck. Computer architectures based on Cache Coherent Non-Uniform Memory Access (CCNUMA) are known in the art as an extension of SMP that supplants SMP's "shared memory architecture." CCNUMA architectures are typically characterized as having distributed global memory.

j / k navigate · click thread line to collapse