The surprising thing is inter-modality shared variation. I wouldn't have bet against it but I also wouldn't have guessed it.
I would like to see model interpretability work into whether these subspace vectors can be interpreted as low level or high level abstractions. Are they picking up low level "edge detectors" that are somehow invariant to modality (if so, why?) or are they picking up higher level concepts like distance vs. closeness?