* No DRAM or caches, everything is in SRAM, and all local SRAM loads are 1 cycle.
* Model parallel alone is full performance, no need for data parallel if you size to fit.
* Defects are handled in hardware; any latency differences are hidden & not in load path anyway.
* Fully asynchronous/dataflow by default, only need minimal synchronization between forward/backward passes.
I genuinely don't know how you'd build a simpler system than this.
In particular, when you change the system from having to worry about how to optimally schedule a single state machine to having to place operations on a fixed routing grid (à la FPGA), the problem becomes radically different, and any looping control flow becomes an absolute nail-biter of an issue.
Having talked to someone at Cerebras, I also know that they don't just want to do inference with this, they want to accelerate training as well. That can involve much more complex control flow than you think. Start reading about automatic differentiation and you will soon realize that it's complex enough to basically be its own subfield of compiler design. There have been multiple entire books written on the topic, and I can guarantee you there can be control-flow driven optimizations in there (eg: if x == 0 then don't compute this large subgraph).
The really complicated bit is converting the tensorflow model to some kind of computation plan. Where do you put all the tensor data? How do you move it around the chip. It's insanely complicated. If anything kills Cerebras it will be the software.
https://secureservercdn.net/198.12.145.239/a7b.fcb.myftpuplo...
Then you put your data next to the core that uses it. Simples.
(Optimal placement is tricky, but approximate techniques work fine.)
I work on something similar but less ambitious, trust me it is crazy complicated.