undefined | Better HN

0 pointsKeplerBoy1y ago0 comments

The pcie configuration was taken from the mac pro and it's m2 ultra. https://www.apple.com/mac-pro/

I'd assume the mac mini has a less extensive pcie/tb subsystem.

No idea what people are doing with all those pcie slots except for nvme cards. I wonder how hard it would be to talk to a pcie fpga.

0 comments

morphle1y ago

You use SerDes high speed serial links (up to 224 Gbps in 2025) to communicate between chips. A PCIe lane is just a Serdes with a 30% packet protocol overhead that uses DMA to copy bytes between to SRAM or DRAM buffers.

You aggregate PCIe lanes (x16, x8, x4/Thunderbolt, x1). You could also built mesh networks from SerDes but now instead of PCIe switches You would need SerDes switches or routers (Ethernet, NVlink, Infiniband).

You need those high speed links between chips for much more than SSD/NVME cards. Other NAS, Processors, Ethernet/internet, Camera, Wifi, Optics, DRAM, SRAM, power etc. For intercore communication (between processors or between chiplets), between networked PCB's, between DRAM chips (DDR5 is just another SerDes protocol), Flash Chips, camera chips, etc. Any other chip at faster then 250 Mbps speeds.

I aggregate all the M4 Mac mini ports into a M4 cluster by mesh networking all its Serdes/PCIe with FPGAs into a very cheap low power supercomputer with exaflop performance. Cheaper than NVDIA. I'm sure Apple does the same in their data centers.

My talk [1] on Wafer Scale Integration and free space optics goes deeper into how and why SerDes and PCIe will be replaced by fiber optics and free space optics for power reasons. I'm sure several parallel 2 Ghz optic lambdas per fiber (but no SerDes!) will be the next step in Apple Silicon as well: the M4 power budget already is mostly in the off-chip SerDes/Thunderbolt networking links.

[1] https://vimeo.com/731037615

KeplerBoyOP1y ago

> I aggregate all the M4 Mac mini ports into a M4 cluster by mesh networking all its Serdes/PCIe with FPGAs into a very cheap low power supercomputer with exaflop performance. Cheaper than NVDIA. I'm sure Apple does the same in their data centers.

That sounds super interesting, do you happen to have some further information on that? Is it just a bunch of FPGAs issuing DMA TLPs?

ricktdotorg1y ago

sounds (at least at a high level) similar to EXO[1]

[1] https://github.com/exo-explore/exo

morphle1y ago

Here a video of testing Exo to run huge LLMs on a cluster of M4 Macs[1] more cheaply than with a cluster of NVDIA RTX 4090s.

[1] https://www.youtube.com/watch?v=GBR6pHZ68Ho

1 more reply

morphle1y ago

It is not the first time they built super computers from off the shelf Apple machines [1].

M4 supercomputers are cheaper and it also will be lower Capex and Apex for most datacenter hardware.

>do you happen to have some further information on that?

Yes, the information is in my highly detailed custom documentation for the programmers and buyers of 'my' Apple Silicon super computer, Squeak and Ometa DSL programming languages and adaptive compiler. You can contact me for this highly technical report and several scientific papers (email in my profile).

Do you know of people who might buy a super computer based on better specifications? Or even just buyers who will go for 'the lowest Capex and the lowest Opex supercomputer in 2025-2027'?

Because the problem with HPC is that almost all funders and managers buy supercomputers with a safe brand name (Nvidia, AMD, Intel) at triple the cost and seldom from a super computer researcher as myself. But some do, if they understand why. I have been designing, selling, programming and operating super computers since 1984 (I was 20 years old then), this M4 Apple Silicon Cluster will be my ninth supercomputer. I prefer to build them from the ground up with our own chip and wafer scale integration designs but when an off-the-shelf chip is good enough I'll sell that instead. Price/Performance/Watt is what counts, ease of programming is a secondary consideration for what performance you achieve. Alan Kay argues you should rewrite your software from scratch [2] and do your own hardware [3] so that is what I've done sinds I learned from him.

>Is it just a bunch of FPGAs issuing DMA TLPs?

No. The FPGA's are optional for when you want to flatten the inter-core (=inter-SRAM cache) networking with switches or routers to a shorter hop topology for the message passing like a Slim fly diameter two hop topology [4].

DMA (Direct Memory Access) TLPs (Transaction Layer Packets) are one of the worst ways of doing inter-core and inter-SRAM communication and on PCIe it has a huge 30% protocol overhead at triple the cost. Intel (and most other chip companies like NVIDIA, Altera, AMD/XILINX) can't design proper chips because they don't want to learn about software [2]. Apple Silicon is marginally better.

You should use pure message passing between any process, preferably in a programming language and a VM that uses pure message passing at the lowest level (Squeak, Erlang). Even better if you then map those software messages directly to message passing hardware as in my custom chips [3].

The reason to reverse Apple Silicon instructions for CPU, GPU and ANE are to be able to adapt my adaptive compiler to M4 chips but also to repurpose PCIe for low level message passing with much better performance and latency than DMA TLPs.

To conclude, if you want to get the cheapest Capex and Opex M4 Mac mini supercomputer you need to rewrite your supercomputing software in a high level language and message passing system like the parallel Squeak Smalltalk VM [3] with adaptive load balancing compilation. C, C++, Swift, MPI or CUDA would result in sub-optimal software performance and orders of magnitude more lines of code when optimal performance of parallel software is the goal.

[1] https://en.wikipedia.org/wiki/System_X_(supercomputer)

[2] https://www.youtube.com/watch?v=ubaX1Smg6pY

[3] https://vimeo.com/731037615

[4] https://www.youtube.com/watch?v=rLjMrIWHsxs

morphle1y ago

I forgot to add links to talk [5] by IBM Research on massively parallel Squeak Smalltalk and why it might be relevant for Apple Silicon reverse engineering and M4 clusters.

Talk [6] on free space optical interconnects without SerDes some day showing up on low power Apple Silicon (around M6-M8 models).

[5] https://www.youtube.com/watch?v=GBtqQwcJoN0

[6] https://www.youtube.com/watch?v=-dQoImLNgWs

j / k navigate · click thread line to collapse

0 comments

morphle1y ago

[1] https://vimeo.com/731037615

KeplerBoyOP1y ago

That sounds super interesting, do you happen to have some further information on that? Is it just a bunch of FPGAs issuing DMA TLPs?

ricktdotorg1y ago

sounds (at least at a high level) similar to EXO[1]

[1] https://github.com/exo-explore/exo

morphle1y ago

Here a video of testing Exo to run huge LLMs on a cluster of M4 Macs[1] more cheaply than with a cluster of NVDIA RTX 4090s.

[1] https://www.youtube.com/watch?v=GBR6pHZ68Ho

1 more reply

morphle1y ago

It is not the first time they built super computers from off the shelf Apple machines [1].

M4 supercomputers are cheaper and it also will be lower Capex and Apex for most datacenter hardware.

>do you happen to have some further information on that?

Do you know of people who might buy a super computer based on better specifications? Or even just buyers who will go for 'the lowest Capex and the lowest Opex supercomputer in 2025-2027'?

>Is it just a bunch of FPGAs issuing DMA TLPs?

[1] https://en.wikipedia.org/wiki/System_X_(supercomputer)

[2] https://www.youtube.com/watch?v=ubaX1Smg6pY

[3] https://vimeo.com/731037615

[4] https://www.youtube.com/watch?v=rLjMrIWHsxs

morphle1y ago

I forgot to add links to talk [5] by IBM Research on massively parallel Squeak Smalltalk and why it might be relevant for Apple Silicon reverse engineering and M4 clusters.

Talk [6] on free space optical interconnects without SerDes some day showing up on low power Apple Silicon (around M6-M8 models).

[5] https://www.youtube.com/watch?v=GBtqQwcJoN0

[6] https://www.youtube.com/watch?v=-dQoImLNgWs

j / k navigate · click thread line to collapse