It's basically not possible to do what you are trying to do in an async manner. With advancements in large batch gradients, it might be possible to do some sort of synchronous P2P gradient averaging.
What about with some fairly frequent and periodic synchronization?
Is there potentially some balance where small enough subsets can be chosen and disparate workers broadcast the small changes at small enough intervals that the net gain in learnings is still larger than the loss in fit due to de-cohesion. I was thinking maybe this algorithm would be 10x less energy efficient but have the benefit of decentralization. Something along those lines.
I’m guessing the current training algorithms do something like this but since rapid synchronization always makes the efficiency increase (in the extreme that giant single wafer cpu) then openAI and others use systems with high interconnect bandwidth.
> where small enough subsets can be chosen and disparate workers broadcast the small changes at small enough intervals that the net gain in learnings is still larger than the loss in fit due to de-cohesion
I think this really probably depends on the terrain of your loss landscape. My intuition is that many are too spike-y and if you take a step or two in each of your subsets and then average them, you will end up on a steep hill rather than a valley between your two points.
But this is an active area of research for sure.