I've always wondered what proportion of modern real-time video effects rely on ML vs. classical image processing; this not only answers that question, but provides details down to the level of model architecture and the final latency and IOU benchmarks.
Of course I'd be more interested to read how Zoom manages to do even better, but I'm not holding my breath for them to publish those details.