Any in-frame motion probably allows you to align to frame after the fact. This is existing technology, and gives you timestamp to frame alignment.
If you are reconstructing sound, you can now fuzz the time alignments to give the maximum signal for the maximum time (non-correlation will damp to random noise quickly). This allows you to pairwise reconstruct time alignments.
At that point, you put them all together and run your detailed analysis.
Now, I didn't say this way EASY. :) Or cheap. Or real-time.
Just that it is possible.