It's a fact of neural networks that to train them supervised you need the training data in the expected input for(vector of n thousand preceding tokens for LLMs) with the expected output(the next token for LLMs). "Training them on video" would mean converting the video to a format we can train the llm with, then training the LLM with that info.
This would probably be a 1 OOM increase at maximum, if the video transcripts aren't already a part of the training data for gpt.