I'd sooner expect them to use this to 'feed' a larger neural path tracing engine where you can get away with 1 sample every x frames. Those already do a pretty great job of generating great looking images from what seems like noise.
I don't think this conventional similarity matrix in the paper is all that important to them