So, essentially, most ML frameworks' expressivity is heavily constrained by what the framework knows how to "efficiently" differentiate, generally via AD. Our paper presents a technique for improving many ML frameworks' AD implementations for a really common class of operations ("broadcast" operations) in a way that not only benefits performance, but benefits programmability as well.
More specifically, right now, ML frameworks mainly only support one kind of AD (reverse-mode). There's another kind of AD (forward-mode) that is more efficient in certain cases (and less efficient in others). Furthermore, certain programmatic constructs (like broadcasting certain kinds of kernels) are intractable to differentiate on a GPU in reverse-mode can be tractable if you utilize forward-mode instead.
It's generally a hard problem to combine the two modes in an optimal manner. However, we present a technique that enables the implementer to easily interleave the two modes specifically for differentiating broadcast operations. Our technique also removes some barriers to important compiler-level optimizations (like operator fusion), and thus can improve performance.
> I have never heard the term 'broadcasting' in this context
If you've used numpy, you've likely used this kind of "broadcasting"; AFAIK it was numpy that popularized the use of the term "broadcasting" for this operation in Python (though I'm no Python historian, so take that with a grain of salt).