Your criticism is totally fair. People have been curious about Storm so we wanted to provide a little bit of information about it. We'll have demos soon, and of course it will be open sourced within a few months.
If you're curious about our credibility, I think our other open source projects speak to the quality of software we produce:
https://github.com/nathanmarz/cascalog https://github.com/nathanmarz/elephantdb
Now, I agree that it's kind of a bummer we can't play with it right now, but the fact that you guys made this are are going to open source it is already awesome in itself.
Response - http://twitter.com/#!/kevinweil/status/73263430873792512
If you can't make it opensource, at least write a serious paper to support the claims. Like Google did for Big-Table.
A lot of people think their systems are scalable and fault-tolerant. Most are not. And from the information provided, we can't tell.
We're a startup — we're not going to write an academic paper supporting the claims in the post. Nevertheless, Storm's an exciting project many people are curious to learn more about; that's why we've written something about it now.
We have a demo coming soon, and Storm itself will be open sourced soon enough.
It's a common misconception. A real time system doesn't have to be fast, efficient, or fault tolerant. A real time system must guarantee with 100% certainty that in all cases it will respond to input X within a time period of Y.
I would be interested to learn the timing issues driving the development of this system and how you've guaranteed such a response time, especially given that it's running on top of the JVM and must therefore deal with a non-deterministic garbage collection process.
Every time I see a post describing a "real time" system I always read into it hoping that what they are describing is a hard real time system, because they are neat, but they never are, probably because they are so difficult and expensive to build. Also I guess they aren't the most relevant type of system for the majority of people here, who are dealing (as you say) with customer-facing front ends.
(1) What do you mean by a processing topology -- is this a data dependency graph?
(2) How does one define a topology? Is this specified at deployment time via the jar file, or can it be configured separately and on the fly?
(3) Must records be processed in time order, or can they be sorted and aggregated on some other key?
2. To deploy a topology, you give the master machine a jar containing all the code and a topology. The topology can be created dynamically at submit time, but once it's submitted it's static.
3. You receive the records in the order the spouts emit them. Things like sorting, windowed joins, and aggregations are built on top of the primitives that Storm provides. There's going to be a lot of room for higher level abstractions on top of Storm, ala Cascalog/Pig/Cascading//Hive on top of Hadoop. We've already started designing a Clojure-based DSL for Storm.
For a variety of reasons, I keep my browser windows about 900 pixels wide. Your site requires a honking 1280 to get rid of the horizontal scrollbar -- and can't be read in 900 without scrolling horizontally for every line (i.e. the menu on the left is much too wide).
(OT, I know, but it's a pet peeve of mine. It's been known for years how to use CSS to make pages stretch or squish, within reason, to the user's window width. 900 is not too narrow!)
EDITED to add: yeah, I'm willing to spend some karma points on this, if that's what happens. Wide sites are getting more common, and this is one of the worst I've seen.
I find your width comments especially relevant to me right because we have started a new site design project focusing on offering 5 different width based layouts. Your comment is proof that offering multiple content width options to desktop users and not just mobile is useful to some people besides just me.
(I mean on the actual processing front, rather than architecturally -- sounds like Storm is a bunch of building blocks instead of a unified system.)
storm jar mycode.jar backtype.storm.my_topology
In this example, the "backtype.storm.my_topology" defines the realtime computation as a processing graph. The "storm jar" command causes the code to be distributed across the cluster and executes it using many processes across the cluster. Storm makes sure the topology runs forever (or at least until you stop the topology).
(I can't say I'm intimately familiar with every CEP system out there, so feel free to correct me if there are distributed CEP systems. Those products tend to have webpages which make it hard to decipher what they actually are / do)
Or you could use a Graph DB to solve a Graph problem.
URL -> tweeted_by -> users -> followed_by -> users
Try that on Neo4j.
The reach computation on Storm does everything in parallel (across however many machines you need to scale the computation) and gets data using distributed key/value databases (Riak, in our case).
This is the traditional realtime processing use case: process messages and update a variety of databases.
Question: I typically think of real-time as a need for user-facing things, i.e. handling a user's requests before he gets bored and goes away. Is Storm set up for that? Or is it mostly meant to update a database with results rather than return them to a waiting process?
In "realtime" analysis, you tell the system "these are the queries I want you to run" and it continuously updates those with answers as data arrives.
Also, this sounds faintly like the old SunGridEngine.
The projects share similarities. The biggest difference with S4 is that Storm guarantees that messages will be processed, whereas S4 will just drop the messages. Getting this level of reliability is more than just using TCP to send messages - you need to track the processing of messages in an efficient way and retry messages if the message doesn't get completed for some reason (like a node goes down). Implementing reliability is non-trivial and affects the design of the whole system.
We also felt that there was a lot of accidental complexity in S4's API, but that's a secondary issue.
One of the main problems they solve is "distributed RPC", from TFA: "There are a lot of queries that are both hard to precompute and too intense to compute on the fly on a single machine."
That's generally a sign that you've made a mistake somewhere in your application design. Pain is a response that tells you "stop doing that".