I suspect that, like self driving, that last 10%, 1%, 0.1% will be both functionally essential and exponentially difficult.
Video calls work great (well once we've sorted out the eye contact issue - now there's a real problem that needs really solving[1]), even with all the ML in the world avatars will be just a pale reflection of the real thing.
[1] You need a screen that is also a composite camera array, so that software can track the eyes on the incoming video feed and place the camera for the outgoing feed at that (moving) location. Sort of like a phased array for light. Thus when you look at someone's eyes, they see you looking directly down the camera.