Video data is still very much untapped and likely to unlock a step function worth of data. Current image-language models are trained mostly on {image, caption} pairs with a bit of extra fine tuning
I’m not sure I agree, text is full of compressed information but lacks all of the visual cues we all use to navigate and understand our world. Video data also has temporal components which text is really bad at.