The idea would be to make an educated guess at where each word occurs in the video - going off the time and subtitle data from pysrt - and build a dict linking words to when they occur in the video. You could then use MoviePy and stitch together a video version of the generated dialogue, by looking up the appropriate clip for each word.