undefined | Better HN

0 pointsVeedrac6y ago0 comments

A bit baffled by this because on every axis I look this seems like a dream of a compilation target.

* No DRAM or caches, everything is in SRAM, and all local SRAM loads are 1 cycle.

* Model parallel alone is full performance, no need for data parallel if you size to fit.

* Defects are handled in hardware; any latency differences are hidden & not in load path anyway.

* Fully asynchronous/dataflow by default, only need minimal synchronization between forward/backward passes.

I genuinely don't know how you'd build a simpler system than this.

0 comments

jcranmer6y ago

Having worked on compilers for pretty weird architectures, it's generally the case that the less like a regular CPU your architecture is, the more difficult it is to compile.

In particular, when you change the system from having to worry about how to optimally schedule a single state machine to having to place operations on a fixed routing grid (à la FPGA), the problem becomes radically different, and any looping control flow becomes an absolute nail-biter of an issue.

VeedracOP6y ago

Remember that you aren't compiling arbitrary programs. Neural nets don't really have any local looping control flow, in the sense that data goes in one end and comes out the other. You'll have large-scale loops over the whole network, and each core might have a loop over small, local arrays of data, but you shouldn't have any sort of internal looping involving different parts of the model.

tachyonbeam6y ago

It's pretty common to have neural networks that have both recurrent nets processing text input and convolutional layers. A classic example would be visual question answering (is there a duck in this picture?). That would be a simple example involving looping over one part of the model. Ideally you want that looping to be done as locally as possible to avoid wasting time having a program on a CPU dispatching, waiting for results and controlling data flow.

Having talked to someone at Cerebras, I also know that they don't just want to do inference with this, they want to accelerate training as well. That can involve much more complex control flow than you think. Start reading about automatic differentiation and you will soon realize that it's complex enough to basically be its own subfield of compiler design. There have been multiple entire books written on the topic, and I can guarantee you there can be control-flow driven optimizations in there (eg: if x == 0 then don't compute this large subgraph).

1 more reply

IshKebab6y ago

I'm afraid recursivecaveat is right. This is an insanely difficult compilation target. I think you're possibly talking about a different kind of "compilation" - i.e. the Clang/GCC bit that converts C++ to machine code. That is indeed trivial. But "compilation" for these chips includes much more than that.

The really complicated bit is converting the tensorflow model to some kind of computation plan. Where do you put all the tensor data? How do you move it around the chip. It's insanely complicated. If anything kills Cerebras it will be the software.

VeedracOP6y ago

It's model parallel, so the first thing you do is lay out your floorplan for the model, which looks like this.

https://secureservercdn.net/198.12.145.239/a7b.fcb.myftpuplo...

Then you put your data next to the core that uses it. Simples.

(Optimal placement is tricky, but approximate techniques work fine.)

IshKebab6y ago

When you consider the things that that diagram doesn't show, it doesn't look at all simple. Does that graph even have training? It'll have to be pipelined too. Probably will have to use recomputation due to the shortage of memory. What about within the boxes? You can't nicely separate a matmul into pieces like that.

I work on something similar but less ambitious, trust me it is crazy complicated.

1 more reply

jhj6y ago

It's not like there is one SRAM, there are many SRAMs, so you get the same problem as NUMA but a thousand fold. Some computations you can map to a regular grid/hypercube/whatever quite easily, but it is unclear what the interconnect between the PEs is here, or what this thing has for a NOC or NOCs, how routing is handled, etc., and further complicating the issue is compensating for any damaged PEs or damaged routes.

VeedracOP6y ago

No, you don't have all the issues with traditional NUMA because you aren't doing the same sort of heterogeneous workloads. You're always working on local data, and streaming your outputs to the next layer. This isn't a request-response architecture; such a thing wouldn't scale.

jhj6y ago

It is more or less the same, it's just that in NUMA you have a limited number of localities, except here it is in the thousands. The issue is one of scheduling that locality. Some process still needs to determine what data is actually local and where it should "flow". Because it can't all fit in one place, the computation needs to be tiled (potentially in multiple ways) and the tiles need to be scheduled to move around in an efficient manner.

j / k navigate · click thread line to collapse

0 comments

jcranmer6y ago

Having worked on compilers for pretty weird architectures, it's generally the case that the less like a regular CPU your architecture is, the more difficult it is to compile.

VeedracOP6y ago

tachyonbeam6y ago

1 more reply

IshKebab6y ago

VeedracOP6y ago

It's model parallel, so the first thing you do is lay out your floorplan for the model, which looks like this.

https://secureservercdn.net/198.12.145.239/a7b.fcb.myftpuplo...

Then you put your data next to the core that uses it. Simples.

(Optimal placement is tricky, but approximate techniques work fine.)

IshKebab6y ago

I work on something similar but less ambitious, trust me it is crazy complicated.

1 more reply

jhj6y ago

VeedracOP6y ago

jhj6y ago

j / k navigate · click thread line to collapse