As you can see from this comments thread, most people, especially programmers, lack the knowledge we computer scientist, parallel programmers and chip or hardware designers have.
>What is your process
Science. To measure is to know, my prof always said.
To answer your questions in detail, email me.
You first need to be specific. The problem is not how to turn Mac minis into a cluster, with or without custom hardware ( I do both) on code X or Y. Or how to optimize software or rewrite it from scratch (which its often cheaper).
First find the problem. In this case the problem is find the lowest OPEX and Capex to do the stated compute load versus changing the compute load. Turns out in a simulation or a cruder spreadsheet calculation it becomes clear that the energy cost dominates of hardware choice, it trumps the cost of programming, the cost of off the shelf hardware and the difference if you add custom hardware. M4's are lower power, lower OPEX and lower CAPEX especially if you rewrite your (Nvidia GPU) software. The problem is the ignorance of the managers and their employee programmers.
You can repurpose the 2 x 10 Gbps USB-C, the 10 Gbps Ethernet and the three 32 Gbps PCIe ports or Thunderbolts but you have to use better drivers. You need to weigh if double the 960 Gbps 16 GB unified memory for 2 x $400 is faster than 2 Tbps memory at 1.23 times the cost versus 3 x 4 x 32 Gbps PCIe 4.0 versus 3 x 120 Gbps unidirectionally is better for this particular algorithm and wheat changes if you uses both the 10 CPU cores, 10 x 400 GPU corses and 16 Neural Engine cores (at 38 trillion 16 bit OPS) will work batter than just the CUDA cores. Ususally the answers is: rewrite the alogoritm and use an adaptive compiler and then a cluster of smaller 'sweet spot' off the shelf hardware will outperform the most fancy high end hardware if the network is balanced. This varies at runtime so you'll only know if you now how to code. As Akan Kay said and Steve Jobs quoted: if your serious about software you should do your own hardware. If you can't, then you can approach the hardware with commodity components if that turns out to be cheaper. I estimate for $42K labour I can save you a few hundred $k.