So, I suppose the next step to this would be to parse a bunch of screenplays from different formats, into a single readable format and then train an image model on the frames of those movies we also trained the text model with screenplays on to get a cross reference of what is written down vs what is displayed visually. And we can break down the visual shots with camera movements, steadicam, dolly move etc as well as identify key props in the image model (maybe. Sounds expensive) and compare them to key props in the script. I don’t know, I’m spitballing now but a multi-modal Hollywood film producer would be kind of fun but this totally is just starting as a way to standardize the script in a granular form and to code since I’m not out on set.
What would be a typical unit? A keyframe? A bucket of frames?
Because the elements are temporal and visual, I think some form of “object in time” relation must be encoded in the description?
For example, when we are shooting there is a rough formula for how long of a day we need to get those scenes. Usually it’s 1 hour per 1 page scene plus an extra 30 mins added on for each character in the scene. But that doesn’t translate to the final product as that information tells us nothing about how long or important a scene should be in the final product.
But it’s also possible I’m getting too ahead of myself here and maybe there’s another object that is created that includes the scene, production and final product objects instead of jamming it all into this object.
Open to anything suggestions you may have.