Storm - the Hadoop of realtime processing (opens in new tab)

(tech.backtype.com)

119 pointslzimm15y ago46 comments

46 comments

Storm sounds great, but this post probably should have waited until it was actually open-sourced. As it is, it just comes across as naked self-promotion based on a technology that could for all we know be vaporware.

nathanmarz15y ago

(I'm the author of Storm)

Your criticism is totally fair. People have been curious about Storm so we wanted to provide a little bit of information about it. We'll have demos soon, and of course it will be open sourced within a few months.

If you're curious about our credibility, I think our other open source projects speak to the quality of software we produce:

https://github.com/nathanmarz/cascalog https://github.com/nathanmarz/elephantdb

SeoxyS15y ago

I think people here are a little too harsh. Storm sounds like an amazing product and I can't wait to play with something like that. Right now, we run a bunch of cron jobs every minute with intense MapReduce queries on mongodb to generate relatively up-to-date analytics. Something like this would be immensely useful. (As well as Mongo's new 2.0 Aggregation pipeline features.)

Now, I agree that it's kind of a bummer we can't play with it right now, but the fact that you guys made this are are going to open source it is already awesome in itself.

1 more reply

nphase15y ago

I'm still waiting on Twitter's rainbird (http://www.slideshare.net/kevinweil/rainbird-realtime-analyt...) to come out!

matclayton15y ago

Me too, I asked them the other day about it,

Response - http://twitter.com/#!/kevinweil/status/73263430873792512

ora60015y ago

Also, the lack of any scalability charts or diagrams of architecture is suspicious.

If you can't make it opensource, at least write a serious paper to support the claims. Like Google did for Big-Table.

A lot of people think their systems are scalable and fault-tolerant. Most are not. And from the information provided, we can't tell.

konsl15y ago

We've released open source projects (most notably ElephantDB and Cascalog) in the past that are successfully used in production by us as well as other companies. You should check them out if you're interested in a measure of quality, though I understand your concern.

We're a startup — we're not going to write an academic paper supporting the claims in the post. Nevertheless, Storm's an exciting project many people are curious to learn more about; that's why we've written something about it now.

We have a demo coming soon, and Storm itself will be open sourced soon enough.

2 more replies

bengl3rt15y ago

Precisely. I read the whole post looking for a link to the source on github or something, and then the last sentence was just a huge letdown.

bmohlenhoff15y ago

It sounds like a neat project, but I think describing it as "real time" is misleading if you're not also providing information on latency. The majority of the provided use cases seem to indicate a high level of scalability and durability, as well as a high level of throughput, but these are not necessary characteristics of a true real time system.

It's a common misconception. A real time system doesn't have to be fast, efficient, or fault tolerant. A real time system must guarantee with 100% certainty that in all cases it will respond to input X within a time period of Y.

I would be interested to learn the timing issues driving the development of this system and how you've guaranteed such a response time, especially given that it's running on top of the JVM and must therefore deal with a non-deterministic garbage collection process.

scott_s15y ago

You described a hard real-time system. That exists for things like the controllers on a jet. What's becoming much more prevalent are soft real-time systems that perform analytics. There won't be any catastrophic failure if deadlines aren't meant - and there may not even be any expressed deadline - it's just understood that the data must be processed and analyzed as fast as possible to be useful.

bmohlenhoff15y ago

I certainly did describe a hard real-time system. It's nice to see that other people recognize the distinction.

Every time I see a post describing a "real time" system I always read into it hoping that what they are describing is a hard real time system, because they are neat, but they never are, probably because they are so difficult and expensive to build. Also I guess they aren't the most relevant type of system for the majority of people here, who are dealing (as you say) with customer-facing front ends.

nathanmarz15y ago

Yes, Storm is more intended for soft realtime problems.

sigil15y ago

This looks interesting. Questions:

(1) What do you mean by a processing topology -- is this a data dependency graph?

(2) How does one define a topology? Is this specified at deployment time via the jar file, or can it be configured separately and on the fly?

(3) Must records be processed in time order, or can they be sorted and aggregated on some other key?

nathanmarz15y ago

1. A processing topology is a graph of how data flows through workers. The roots of the topology are "spouts" that emit tuples into the topology (an example spout is one that reads off of a kestrel queue). The other nodes take in tuples as input and emit new tuples as output. Each node in the graph executes as many threads across the cluster.

2. To deploy a topology, you give the master machine a jar containing all the code and a topology. The topology can be created dynamically at submit time, but once it's submitted it's static.

3. You receive the records in the order the spouts emit them. Things like sorting, windowed joins, and aggregations are built on top of the primitives that Storm provides. There's going to be a lot of room for higher level abstractions on top of Storm, ala Cascalog/Pig/Cascading//Hive on top of Hadoop. We've already started designing a Clojure-based DSL for Storm.

ScottBurson15y ago

TW;DR!!

For a variety of reasons, I keep my browser windows about 900 pixels wide. Your site requires a honking 1280 to get rid of the horizontal scrollbar -- and can't be read in 900 without scrolling horizontally for every line (i.e. the menu on the left is much too wide).

(OT, I know, but it's a pet peeve of mine. It's been known for years how to use CSS to make pages stretch or squish, within reason, to the user's window width. 900 is not too narrow!)

EDITED to add: yeah, I'm willing to spend some karma points on this, if that's what happens. Wide sites are getting more common, and this is one of the worst I've seen.

iramiller15y ago

I had to head back and check the article out again since I had not noticed the width being an issue. For what it is worth it scaled down and was perfectly usable on my iPad.

I find your width comments especially relevant to me right because we have started a new site design project focusing on offering 5 different width based layouts. Your comment is proof that offering multiple content width options to desktop users and not just mobile is useful to some people besides just me.

jbellis15y ago

How is this different from a "traditional" CEP system like Esper?

(I mean on the actual processing front, rather than architecturally -- sounds like Storm is a bunch of building blocks instead of a unified system.)

nathanmarz15y ago

Storm is a unified system. The key difference with other CEP systems is that Storm executes your topologies across a cluster of machines and is horizontally scalable. Running a topology on Storm looks like this:

storm jar mycode.jar backtype.storm.my_topology

In this example, the "backtype.storm.my_topology" defines the realtime computation as a processing graph. The "storm jar" command causes the code to be distributed across the cluster and executes it using many processes across the cluster. Storm makes sure the topology runs forever (or at least until you stop the topology).

(I can't say I'm intimately familiar with every CEP system out there, so feel free to correct me if there are distributed CEP systems. Those products tend to have webpages which make it hard to decipher what they actually are / do)

scott_s15y ago

I work on a similar system that was previously discussed on HN: http://news.ycombinator.com/item?id=2442977

tlrobinson15y ago

Watson? Is that the right link?

scott_s15y ago

It's the right link, but the article is poorly written. See my corrections in the comments.

maxdemarzi15y ago

" To compute reach, you need to get all the people who tweeted the URL, get all the followers of all those people, unique that set of followers, and then count the number of uniques. It's an intense computation that potentially involves thousands of database calls and tens of millions of follower records."

Or you could use a Graph DB to solve a Graph problem.

URL -> tweeted_by -> users -> followed_by -> users

Try that on Neo4j.

nathanmarz15y ago

To do that query on Neo4j, you would need to store in memory on one machine the entire Twitter social graph, all the people who tweeted every URL ever tweeted on Twitter, and then do the computation on a single thread. Neo4j can't handle that scale.

The reach computation on Storm does everything in parallel (across however many machines you need to scale the computation) and gets data using distributed key/value databases (Riak, in our case).

herdrick15y ago

Nathan, we'd love to hear your postmortem on BackType's experience with Neo4J, and how Sphinx is turning out.

1 more reply

herdrick15y ago

This sounds great.

This is the traditional realtime processing use case: process messages and update a variety of databases.

Question: I typically think of real-time as a need for user-facing things, i.e. handling a user's requests before he gets bored and goes away. Is Storm set up for that? Or is it mostly meant to update a database with results rather than return them to a waiting process?

nathanmarz15y ago

It handles both cases. Storm can be used to asynchronously update databases in realtime in a scalable way (replacing traditional systems of queues and workers). Using Storm for Distributed RPC lets you do intense computations on Storm and return them to a waiting process.

herdrick15y ago

Ah yes, I should have read not scanned "3. Distributed RPC". Thanks.

Maro15y ago

I'm not sure if this is the same thing, but there's also a new company called Hadapt (the commercialization of HadoopDB). It's about adapting Hadoop for real-time analytic SQL queries by putting local SQL dbs on the Hadoop nodes and then using the Hadoop plumbing. It's based on Daniel Abadi's research, he's a really smart guy.

jbellis15y ago

The high-level difference is that anything Hadoop-based is oriented around "give me a job to run, I'll go crunch over the data, and spit the result back to you."

In "realtime" analysis, you tell the system "these are the queries I want you to run" and it continuously updates those with answers as data arrives.

pnathan15y ago

Can you comment on distributing non-JAR software?

Also, this sounds faintly like the old SunGridEngine.

nathanmarz15y ago

If you're using a programming language other than Java with Storm, you simply include those files in the jar that you submit to the cluster. For example, you would put your Python processing code "component.py" in the resources/ subdirectory in your jar (A jar is basically just a zipfile).

earl15y ago

How is this different / better than Yahoo S4 [1], which does have code on github? [2]? Why did you choose to build this, or did you start before S4 became public?

[1] http://docs.s4.io/ [2] https://github.com/s4/core

nathanmarz15y ago

Great question. S4 came out right around the time we started designing Storm.

The projects share similarities. The biggest difference with S4 is that Storm guarantees that messages will be processed, whereas S4 will just drop the messages. Getting this level of reliability is more than just using TCP to send messages - you need to track the processing of messages in an efficient way and retry messages if the message doesn't get completed for some reason (like a node goes down). Implementing reliability is non-trivial and affects the design of the whole system.

We also felt that there was a lot of accidental complexity in S4's API, but that's a secondary issue.

tedjdziuba15y ago

This sounds like something that's been painfully over-engineered.

One of the main problems they solve is "distributed RPC", from TFA: "There are a lot of queries that are both hard to precompute and too intense to compute on the fly on a single machine."

That's generally a sign that you've made a mistake somewhere in your application design. Pain is a response that tells you "stop doing that".

m0th8715y ago

So there are no complex distributed problems?

petrilli15y ago

Didn't you know, everything is a website that can be written single-threaded?

j / k navigate · click thread line to collapse

46 comments

vannevar15y ago

nathanmarz15y ago

(I'm the author of Storm)

If you're curious about our credibility, I think our other open source projects speak to the quality of software we produce:

https://github.com/nathanmarz/cascalog https://github.com/nathanmarz/elephantdb

SeoxyS15y ago

Now, I agree that it's kind of a bummer we can't play with it right now, but the fact that you guys made this are are going to open source it is already awesome in itself.

1 more reply

nphase15y ago

I'm still waiting on Twitter's rainbird (http://www.slideshare.net/kevinweil/rainbird-realtime-analyt...) to come out!

matclayton15y ago

Me too, I asked them the other day about it,

Response - http://twitter.com/#!/kevinweil/status/73263430873792512

ora60015y ago

Also, the lack of any scalability charts or diagrams of architecture is suspicious.

If you can't make it opensource, at least write a serious paper to support the claims. Like Google did for Big-Table.

A lot of people think their systems are scalable and fault-tolerant. Most are not. And from the information provided, we can't tell.

konsl15y ago

We have a demo coming soon, and Storm itself will be open sourced soon enough.

2 more replies

bengl3rt15y ago

Precisely. I read the whole post looking for a link to the source on github or something, and then the last sentence was just a huge letdown.

bmohlenhoff15y ago

scott_s15y ago

bmohlenhoff15y ago

I certainly did describe a hard real-time system. It's nice to see that other people recognize the distinction.

nathanmarz15y ago

Yes, Storm is more intended for soft realtime problems.

sigil15y ago

This looks interesting. Questions:

(1) What do you mean by a processing topology -- is this a data dependency graph?

(2) How does one define a topology? Is this specified at deployment time via the jar file, or can it be configured separately and on the fly?

(3) Must records be processed in time order, or can they be sorted and aggregated on some other key?

nathanmarz15y ago

2. To deploy a topology, you give the master machine a jar containing all the code and a topology. The topology can be created dynamically at submit time, but once it's submitted it's static.

ScottBurson15y ago

TW;DR!!

(OT, I know, but it's a pet peeve of mine. It's been known for years how to use CSS to make pages stretch or squish, within reason, to the user's window width. 900 is not too narrow!)

EDITED to add: yeah, I'm willing to spend some karma points on this, if that's what happens. Wide sites are getting more common, and this is one of the worst I've seen.

iramiller15y ago

I had to head back and check the article out again since I had not noticed the width being an issue. For what it is worth it scaled down and was perfectly usable on my iPad.

jbellis15y ago

How is this different from a "traditional" CEP system like Esper?

(I mean on the actual processing front, rather than architecturally -- sounds like Storm is a bunch of building blocks instead of a unified system.)

nathanmarz15y ago

storm jar mycode.jar backtype.storm.my_topology

scott_s15y ago

I work on a similar system that was previously discussed on HN: http://news.ycombinator.com/item?id=2442977

tlrobinson15y ago

Watson? Is that the right link?

scott_s15y ago

It's the right link, but the article is poorly written. See my corrections in the comments.

maxdemarzi15y ago

Or you could use a Graph DB to solve a Graph problem.

URL -> tweeted_by -> users -> followed_by -> users

Try that on Neo4j.

nathanmarz15y ago

The reach computation on Storm does everything in parallel (across however many machines you need to scale the computation) and gets data using distributed key/value databases (Riak, in our case).

herdrick15y ago

Nathan, we'd love to hear your postmortem on BackType's experience with Neo4J, and how Sphinx is turning out.

1 more reply

herdrick15y ago

This sounds great.

This is the traditional realtime processing use case: process messages and update a variety of databases.

nathanmarz15y ago

herdrick15y ago

Ah yes, I should have read not scanned "3. Distributed RPC". Thanks.

Maro15y ago

jbellis15y ago

The high-level difference is that anything Hadoop-based is oriented around "give me a job to run, I'll go crunch over the data, and spit the result back to you."

In "realtime" analysis, you tell the system "these are the queries I want you to run" and it continuously updates those with answers as data arrives.

pnathan15y ago

Can you comment on distributing non-JAR software?

Also, this sounds faintly like the old SunGridEngine.

nathanmarz15y ago

earl15y ago

How is this different / better than Yahoo S4 [1], which does have code on github? [2]? Why did you choose to build this, or did you start before S4 became public?

[1] http://docs.s4.io/ [2] https://github.com/s4/core

nathanmarz15y ago

Great question. S4 came out right around the time we started designing Storm.

We also felt that there was a lot of accidental complexity in S4's API, but that's a secondary issue.

tedjdziuba15y ago

This sounds like something that's been painfully over-engineered.

One of the main problems they solve is "distributed RPC", from TFA: "There are a lot of queries that are both hard to precompute and too intense to compute on the fly on a single machine."

That's generally a sign that you've made a mistake somewhere in your application design. Pain is a response that tells you "stop doing that".

m0th8715y ago

So there are no complex distributed problems?

petrilli15y ago

Didn't you know, everything is a website that can be written single-threaded?

j / k navigate · click thread line to collapse