There's a real difference between a lossy approximation as done by video compression, and the "just a guess" done by DLSS frame generation. Video encoders have the real frame to use as a target; when trying to minimize the artifacts introduced by compressing with reference to other frames and using motion vectors, the encoder is capable of assessing its own accuracy. DLSS fundamentally has less information when generating new frames, and that's why it introduces much worse motion artifacts.
it would be VERY interesting to have actual quantitative data on how many possible I video frames map to a specific P or B frame vs how many possible raster frames map to a given predicted DLSS frame. The lower this ration the more "accurate" the prediction is.