In transformer models, big chunk of memory was parameters, and states for optimizers (because vanilla SGD not used there). The memory optimization technique that removes parameters duplication on each GPU or offload entirely to CPU makes sense.
In computer vision, big chunk of memory was hold by forward layer activations and the memory optimization technique applicable in these cases would be binomial checkpointing.
For anyone who still thinks the blog post has substance: Saying that they partitioned the optimizer state, params, etc to have no redundancies is kind of "duh," sort of like saying "we solved the problem using coding and algorithms." It's obvious that we want to eliminate redundancies to maximize the effective VRAM; it's not like nobody thought of not having redundancies before. The problem is that in general, training models distributed is a weird balancing act between redundancy, network usage, compute, etc. The existing methods, model/pipeline/data parallel, gradient checkpoint/accum, etc all have their pros and cons. Unless ZeRO3 is doing something crazy, it has to be giving something up to get to zero redundancy, and knowing what that is would be very important.
If someone could ELI5 how ZeRO actually works, that would be nice.
I found that this earlier blog post [2] has a much better deep dive (with decent animations and more) into the underlying architecture. The ZeRO-Offload paper [3] also has far more detail about that part of the pipeline.
[1] https://arxiv.org/abs/2006.15704 [2] https://www.microsoft.com/en-us/research/blog/deepspeed-extr... [3] https://arxiv.org/abs/2101.06840