All of this makes it very poorly suited to a collection of heterogeneous compute connected via the internet, which wants a large chunk of mostly independent tasks which have a high compute cost but relatively low bandwidth requirements.
You need enough memory to run the unquantized model for training, then stream the training data through - that part is what is done in parallel, farming out different bits of training data to each machine.
https://www.microsoft.com/en-us/research/blog/zero-deepspeed...
The communications overhead of doing this over the internet might be unworkable though.