When you consider the things that that diagram doesn't show, it doesn't look
at all simple. Does that graph even have training? It'll have to be pipelined too. Probably will have to use recomputation due to the shortage of memory. What about
within the boxes? You can't nicely separate a matmul into pieces like that.
I work on something similar but less ambitious, trust me it is crazy complicated.