https://openmetal.io/docs/edu/openstack/horizontal-scaling-v...In a typical fully connected hidden layer, the neurons each need to compute the values of the all others in the previous layer, so you need all the data in one place. Obviously you can distribute the actual calculations which is what a GPU does, but distributing that over networked CPUs will be incredibly slow and require the whole thing to be loaded into memory on all instances.
My bet is on some kind of light based or analog electric accelerator PCIE card to be the next best thing for this sort of inference, since it should be able to calculate multiple layers at once. FPGAs also work but only for fixed weights.