The semantics look very similar indeed. Does anyone have a comparison between Onyx and Spark?
A few key differences:
Onyx aggressively uses data structures to define the structure of computation, defining the data flow (Onyx workflow) and parameterization (Onyx catalog) of the the computation via clojure maps and vectors. In comparison Flambo and Spark define the structure of computation via functions over collections. One way in which Onyx's approach is powerful is that it becomes trivial to manipulate workflows or catalogs before submitting jobs at runtime, allowing you to add additional tasks, task options, etc.
Onyx also implements batching over streaming operations, whereas Spark appears to be the opposite. There are likely to be trade-offs between these approaches.
Spark is also a lot faster, though this isn't necessarily intrinsic to the approaches.
These are all the things I wrote down that I wanted before I wrote the first line of code.
http://michaeldrogalis.tumblr.com/post/98143185776/onyx-dist...
- Onyx's log doesn't grow particularly large because it's only used for coordination, not for messaging.
- Because the log isn't huge, and can be GC'ed, consumers don't experience high volumes of messages.
- ZooKeeper offers sequential node creation - making it a really good fit for what the log needs to do.
From a brief examination tesser looks a lot simpler (probably because of encoding most of the folding using various monoids). Does onyx have a similar abstraction model that I missed?
Both libraries are awesome.