Onyx: fault tolerant data processing for Clojure (opens in new tab)

(github.com)

103 pointscoding4all11y ago21 comments

21 comments

This looks very interesting. I'm doing some log file processing in Apache Spark in Clojure. Spark is written in Scala, but has a Java API, which is wrapped by Flambo. It looks and feels entirely Clojure.

The semantics look very similar indeed. Does anyone have a comparison between Onyx and Spark?

lbradstreet11y ago

I've used Onyx, but I haven't used Spark, so take this with a grain of salt.

A few key differences:

Onyx aggressively uses data structures to define the structure of computation, defining the data flow (Onyx workflow) and parameterization (Onyx catalog) of the the computation via clojure maps and vectors. In comparison Flambo and Spark define the structure of computation via functions over collections. One way in which Onyx's approach is powerful is that it becomes trivial to manipulate workflows or catalogs before submitting jobs at runtime, allowing you to add additional tasks, task options, etc.

Onyx also implements batching over streaming operations, whereas Spark appears to be the opposite. There are likely to be trade-offs between these approaches.

Spark is also a lot faster, though this isn't necessarily intrinsic to the approaches.

jeletonskelly11y ago

I'm interested to know if you've used Storm at all and how it compares to Onyx. I'm currently considering both for a project.

1 more reply

XPherior11y ago

Hi folks! I'm Michael Drogalis - the primary author. I'm happy to answer any questions.

bmh10011y ago

What were the main pain points that motivated you to develop Onyx? What capabilities do you want to add or have already added that Storm doesn't provide?

XPherior11y ago

See: https://github.com/MichaelDrogalis/onyx/blob/0.5.x/doc/user-...

These are all the things I wrote down that I wanted before I wrote the first line of code.

johnmurray_io11y ago

Checkout the original video introducing Onyx: http://youtu.be/vG47Gui3hYE

maelito11y ago

Live open sourcing !

lkrubner11y ago

If this interests you, then you should also check out the post where Michael Drogalis first introduced this:

http://michaeldrogalis.tumblr.com/post/98143185776/onyx-dist...

dj-wonk11y ago

Re: Onyx's architecture. I would wonder about performance when keeping a shared log in ZooKeeper. Why not use something like Kafka -- it is designed for high-volume, immutable logging. ZK works best for less-frequently changing configuration, such as node connection information or snapshotting. I could be wrong. I'd like to hear your thoughts and experience.

XPherior11y ago

- Picking up Kafka means introducing another dependency.

- Onyx's log doesn't grow particularly large because it's only used for coordination, not for messaging.

- Because the log isn't huge, and can be GC'ed, consumers don't experience high volumes of messages.

- ZooKeeper offers sequential node creation - making it a really good fit for what the log needs to do.

boothead11y ago

Looks superficially simmilar to https://github.com/aphyr/tesser anyone know both and can give a comparison?

From a brief examination tesser looks a lot simpler (probably because of encoding most of the folding using various monoids). Does onyx have a similar abstraction model that I missed?

erichmond11y ago

Onyx is distributed and Tesser just uses all the available cores of a particular machine AFAIK.

Both libraries are awesome.

boothead11y ago

Tesser also allows you to distribute it using hadoop i think. I haven't used it, I only happened to hear about it why @aphyr gave a talk at the clojure exchange in London.

j / k navigate · click thread line to collapse

21 comments

afandian11y ago

The semantics look very similar indeed. Does anyone have a comparison between Onyx and Spark?

lbradstreet11y ago

I've used Onyx, but I haven't used Spark, so take this with a grain of salt.

A few key differences:

Onyx also implements batching over streaming operations, whereas Spark appears to be the opposite. There are likely to be trade-offs between these approaches.

Spark is also a lot faster, though this isn't necessarily intrinsic to the approaches.

jeletonskelly11y ago

I'm interested to know if you've used Storm at all and how it compares to Onyx. I'm currently considering both for a project.

1 more reply

XPherior11y ago

Hi folks! I'm Michael Drogalis - the primary author. I'm happy to answer any questions.

bmh10011y ago

What were the main pain points that motivated you to develop Onyx? What capabilities do you want to add or have already added that Storm doesn't provide?

XPherior11y ago

See: https://github.com/MichaelDrogalis/onyx/blob/0.5.x/doc/user-...

These are all the things I wrote down that I wanted before I wrote the first line of code.

johnmurray_io11y ago

Checkout the original video introducing Onyx: http://youtu.be/vG47Gui3hYE

maelito11y ago

Live open sourcing !

lkrubner11y ago

If this interests you, then you should also check out the post where Michael Drogalis first introduced this:

http://michaeldrogalis.tumblr.com/post/98143185776/onyx-dist...

dj-wonk11y ago

XPherior11y ago

- Picking up Kafka means introducing another dependency.

- Onyx's log doesn't grow particularly large because it's only used for coordination, not for messaging.

- Because the log isn't huge, and can be GC'ed, consumers don't experience high volumes of messages.

- ZooKeeper offers sequential node creation - making it a really good fit for what the log needs to do.

boothead11y ago

Looks superficially simmilar to https://github.com/aphyr/tesser anyone know both and can give a comparison?

From a brief examination tesser looks a lot simpler (probably because of encoding most of the folding using various monoids). Does onyx have a similar abstraction model that I missed?

erichmond11y ago

Onyx is distributed and Tesser just uses all the available cores of a particular machine AFAIK.

Both libraries are awesome.

boothead11y ago

Tesser also allows you to distribute it using hadoop i think. I haven't used it, I only happened to hear about it why @aphyr gave a talk at the clojure exchange in London.

j / k navigate · click thread line to collapse