IMO this will require not just much more expansive multi-modal training, but also novel architecture, specifically, recurrent approaches; plus a well-known set of capabilities most systems don't currently have, e.g. the integration of short-term memory (context window if you like) into long-term "memory", either episodic or otherwise.
But these are as we say mere matters of engineering.