undefined | Better HN

0 pointsfrankmcsherry5y ago0 comments

It just comes down to something as simple as: "if I have shown you 1M different things, and now show you one more thing, what do you have to do to tell me whether that one thing is new or not?"

If you can keep a hash map of the things you've seen, then it is easy to respond quickly to that one new thing. If you are not allowed to maintain any state, then you don't have a lot of options to efficiently respond to the new thing, and most likely need to re-read the 1M things.

That's the benefit of being stateful. There is a cost too, which is that you need to be able to reconstruct your state in the case of a failure, but fortunately things like differential dataflow (built on TD) are effectively deterministic.

Also, I suspect "Spark" is a moving target. The original paper described something that was very much a batch processor; they've been trying to fix that since, and perhaps they've made some progress in the intervening years.

0 comments

shay_ker5y ago

I see. To my small brain it sounds like TD can intelligently memoize or cache the outputs of each "step" so that it only recalculates when it needs to as the inputs change.

I think Spark does that sometimes these days, but I don't know much about the specifics of how and when Spark does it.

Does TD have to keep _everything_ in memory, or can it be strategic in what it keeps and what it evicts?

frankmcsherryOP5y ago

TD lets you write whatever logic you want (it is fairly unopinionated on your logic and state).

Differential dataflow plugs in certain logic there, and it does indeed maintain a synopsis of what data have gone past, sufficient to respond to future updates but not necessarily the entirety of data that it has seen.

It would be tricky to implement DD over classic Spark, as DD relies on these synopses for its performance. There are some changes to Spark proposed in recent papers where it can pull in immutable LSM layers w/o reading them (e.g. just mmapping them) that might improve things, but until that happens there will be a gap.

shay_ker5y ago

Gotcha. Thanks for answering all my q's!

j / k navigate · click thread line to collapse

0 comments

shay_ker5y ago

I see. To my small brain it sounds like TD can intelligently memoize or cache the outputs of each "step" so that it only recalculates when it needs to as the inputs change.

I think Spark does that sometimes these days, but I don't know much about the specifics of how and when Spark does it.

Does TD have to keep _everything_ in memory, or can it be strategic in what it keeps and what it evicts?

frankmcsherryOP5y ago

TD lets you write whatever logic you want (it is fairly unopinionated on your logic and state).

shay_ker5y ago

Gotcha. Thanks for answering all my q's!

j / k navigate · click thread line to collapse