undefined | Better HN

0 pointslatchkey2y ago0 comments

AMD MI300x is 163.4 TFLOPS in FP64.

33 of them, which would also have 6,336TB of memory.

I'll have way more than that in my next purchase order.

It is really fun to build a super computer.

0 comments

I'm an amateur, but I have code that I think could probably dispatch threads pretty efficiently on the Cheyenne thru it's management system simply because it's all xeons distributed. If I can run it on my personal 80-core cluster, I could get it to run on Cheyenne back then.

But hitting the roofline on those AMD GPGPU's? I'd probably get nowhere fucking close.

That is the thing that Cheyenne was built for. People doing CFD research with x86 code that was already nicely parallelized via OpenMPI or whathaveyou.

latchkeyOP2y ago

It is wild how much compute has grown.

I put dual Epyc 9754 into my first box of MI300x.

That's 256 cores + 8x MI300x, in a single box.

Agreed, it is a great solution for CFD, which is definitely one workload I'd love to host.

dekhn2y ago

I used to build small clusters and use supercomputers and I can't imagine it's fun to build a super computer. It requires a massive infrastructure and significant employee base, and individual component failures can take down entire jobs. Finding enough jobs to keep the system loaded 24/7 while also keeping the interconnect (which was 15-20% of the total system cost) busy, and finding the folks who can write such jobs, is not easy. Even then, other systems will be constantly nipping at your heels with newer/cheaper/smaller/faster/cooler hardware.

latchkeyOP2y ago

Thanks for the feedback. You make a lot of good points. I've built a 150,000 GPU system previously, but it was lower end hardware. It was a lot of fun to make it run smoothly with its own challenges.

It doesn't take a lot of employee's, we did the above on essentially two technical people. Those same two are working on this business.

Finding workloads/jobs is definitely going to be an interesting adventure, that said, the need for compute isn't going away. By offering hard to get hardware at reasonable rates and contract lengths, I believe we are in a good position on that front, but time will tell.

We are only buying the best of the best that we can get today. The plan is to continuously cycle out older hardware as well as not pick sides on one over another. This should help us keep pace with other systems.

dekhn2y ago

150K GPU with two people... presumably, 8 GPU/host, you had close to 20K servers.

I can't really see how that's achievable with only two people, given the time to install hardware, maintain it, deal with outages and planned maintainence and testing, etc. Note: I worked at Google and interfaced with hwops so I have some real-world experience to compare to.

Building a 150K GPU system without a well-understood customer base seems a bit crazy to me. You will either become a hyperscale, serve a niche, or go out of business, I fear.

latchkeyOP2y ago

7 separate data centers all around the US.

12 GPU/host. 130,000 of that kind. ~10,833 hosts.

The ASRock BC-250's we deployed were 12 individual blades and those were all PXE booted. We deployed 20,000 of those blades across 2 data centers. This was a massive feat of engineering, especially during covid where I couldn't even access the machine directly. Built a whole dashboard to monitor it all too.

I know, I can't believe we did it either, but we did. Software automation was king. I built a single binary that ran on each individual host and knew how to self configure / optimize everything. Idempotently. Even distributing upgrades to the binary was a neat challenge that I solved perfectly, in very creative ways.

Today, we are starting much smaller. Literally from zero/scratch. Given the cost of MI300x, I doubt we will ever get to 150k GPUs, that's an absurd amount of money, but who knows.

1 more reply

j / k navigate · click thread line to collapse

0 comments

mk_stjames2y ago

But hitting the roofline on those AMD GPGPU's? I'd probably get nowhere fucking close.

That is the thing that Cheyenne was built for. People doing CFD research with x86 code that was already nicely parallelized via OpenMPI or whathaveyou.

latchkeyOP2y ago

It is wild how much compute has grown.

I put dual Epyc 9754 into my first box of MI300x.

That's 256 cores + 8x MI300x, in a single box.

Agreed, it is a great solution for CFD, which is definitely one workload I'd love to host.

dekhn2y ago

latchkeyOP2y ago

Thanks for the feedback. You make a lot of good points. I've built a 150,000 GPU system previously, but it was lower end hardware. It was a lot of fun to make it run smoothly with its own challenges.

It doesn't take a lot of employee's, we did the above on essentially two technical people. Those same two are working on this business.

dekhn2y ago

150K GPU with two people... presumably, 8 GPU/host, you had close to 20K servers.

Building a 150K GPU system without a well-understood customer base seems a bit crazy to me. You will either become a hyperscale, serve a niche, or go out of business, I fear.

latchkeyOP2y ago

7 separate data centers all around the US.

12 GPU/host. 130,000 of that kind. ~10,833 hosts.

Today, we are starting much smaller. Literally from zero/scratch. Given the cost of MI300x, I doubt we will ever get to 150k GPUs, that's an absurd amount of money, but who knows.

1 more reply

j / k navigate · click thread line to collapse