I've seen a deer on a road maybe once. I've seen a rabbit on a road zero times. But I know what to do if I see one.
Is that because the "video" of my perception has many "frames"? Even if that's true at some level, I think it's massively missing the point. Yeah, so I saw that one deer from a lot of angles. But current AI training is like the equivalent of taking every deer that has ever been on camera in the history of the human species.
Somehow I'm still dramatically better at generalization than the AI. Surely that's an algorithm difference.