If suddenly robot manipulators could grasp any object, operate any knob/switch, tie knots, manipulate cloth, with the same manipulator, on first sight, that would be quite a feat.
But then there's still task planning which is a very different topic. And ... and .... So much still to develop for generally useful robots.
Just getting it to navigate itself using vision would mean building a complex system with a lot of pieces (beyond the most basic demo anyway). You need separate neural nets doing all kinds of different tasks and you need a massive training system for it all. You can see how much work Tesla has had to do to get a robot to safely drive on public roads. [2]
From where I am sitting now, I think we are making good inroads on something like an "Imagenet moment" for robots. (Well, I should note that I am a robotics engineer but I mostly work on driver level software and hardware, not AI. Though I follow the research from the outside.)
It seems like a combination of transformers plus scale plus cross domain reasoning like CLIP [3] could begin to build a system that could mimic humans. I guess as good as transformers are we still haven't solved how to get them to learn for themselves, and that's probably a hard requirement for really being useful in the real world. Good work in RL happening there though.
Gosh, yeah, this is gonna take decades lol. Maybe we will have a spark that unites all this in one efficient system. Improving transformer efficiency and achieving big jumps in scale are a combo that will probably get interesting stuff solved. All the groundwork is a real slog.
[1] https://reboot.love/t/new-cameras-on-rover/277
RL, which I think this particular story is about, is an odd-duck. I have papers on this and I personally have mixed feelings. I am a very applications/solutions-oriented researcher and I am a bit skeptical about how pragmatic the state of the field is (e.g. reward function specification). The argument made by the OpenAI founder on RL not being amenable to taking advantage of large datasets is a pretty valid point.
Finally, you raise interesting points on running multiple complex DNNs. Have you tried hooking things to ROS and using that as a scaffolding (I'm not a robotics guy .. just dabble in that as a hobby so curious what the solutions are). Google has something called MediaPipe, which is intriguing but maybe not what you need. I've seen some NVIDIA frameworks but they basically do pub-sub in a sub-optimal way. Curious what your thoughts are on what makes existing solutions insufficient (I feel they are too!)