undefined | Better HN

0 pointsgruseom15y ago0 comments

That's very interesting. Can you elaborate? I assume your point is that the HPC people were forced into message-passing by the lack of large-scale shared memory, but would have preferred the latter because they found it easier to program for? And that this somewhat contradicts our fashionable ideas about how to write concurrent programs?

Also, is their situation analogous to multicore today? i.e. does many GB of RAM shared by many cores count as "large-scale shared memory" (or does it not, e.g. because of cache effects)?

0 comments

kenjackson15y ago

Pretty much what you describe is my point. That is why the common paradigm in HPC was that you used OpenMP on a node (shared memory model) and MPI across nodes. It's not difficult to do MPI on a node, but nobody wanted to do that. And there were several attempts to put OpenMP on clusters. The most recent one I know of being Intel (http://cache-www.intel.com/cd/00/00/28/58/285865_285865.pdf).

I think it is still a little analogous. The main takeaway is that message passing doesn't make things easy. I've spent many of days debugging message passing applications. You often trade-in one type of problem for another. If anyone is interested in more detail, I can go into an example or two.

gruseomOP15y ago

Yes, please go into an example or two.

kenjackson15y ago

So here's a typical problem that I'd have in an HPC application. I'd have some space, represented by some 3D structure (maybe an array, or even an object-based particle system). I need to do some computation over this space -- often using some type of stencil, so in order to compute the value at <x,y,z> I need values of coordinates some distance from x,y,z.

The part that often ends up being tricky is the fact that I need to send data from processor A to processor B. And I want to send as little data as possible. So one of the first sources of bugs is that when I do my gather-scatter I make a mistake mapping a value to a coordinate. In shared-memory you never have to do this mapping back and forth, so its not an issue.

Next issue is related to the fact that I don't want to ever block waiting for data. There are a variety of models for handling this. I can do a non-blocking receive, and do some work waiting for the data to arrive. This is often another source bugs as people will often do work that depends on the new data, but they chug along without it. Add the new data when they get it, and alas their computation is already hosed.

And the last common error in this case is handing the data off to the wrong object (or processor) or being confused as to which data you're receiving at any given point in time.

Now all of these can be handled by simply being careful, and using some good programming practices. But they are just simple, if not grossly naive, examples of issues you have with traditional message passing that don't exist in shared memory.

1 more reply

j / k navigate · click thread line to collapse

0 comments

kenjackson15y ago

gruseomOP15y ago

Yes, please go into an example or two.

kenjackson15y ago

And the last common error in this case is handing the data off to the wrong object (or processor) or being confused as to which data you're receiving at any given point in time.

1 more reply

j / k navigate · click thread line to collapse