If you can keep a hash map of the things you've seen, then it is easy to respond quickly to that one new thing. If you are not allowed to maintain any state, then you don't have a lot of options to efficiently respond to the new thing, and most likely need to re-read the 1M things.
That's the benefit of being stateful. There is a cost too, which is that you need to be able to reconstruct your state in the case of a failure, but fortunately things like differential dataflow (built on TD) are effectively deterministic.
Also, I suspect "Spark" is a moving target. The original paper described something that was very much a batch processor; they've been trying to fix that since, and perhaps they've made some progress in the intervening years.