Yahoo Open Sources S4 Real-Time MapReduce Framework (opens in new tab)

(github.com)

42 pointsatarashi15y ago15 comments

15 comments

the entire point was that it's not mapreduce, it's a stream processing system. try http://wiki.s4.io/Manual/S4Overview#toc1 for a "what is S4" intro.

The architecture resembles the Actors model, providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers.

and

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

go stream processing. also see Flume for related recent works.

DanielRibeiro15y ago

Seems very much like Percolator from google: http://www.infoq.com/news/2010/10/google-percolator

schumihan15y ago

I don't think so. Percolator does not support stream operators, and it supports transaction semantic. Streaming process engine like s4 will never handle the race conditions brought by multiple writers.

atarashiOP15y ago

Indeed, it's not really MapReduce (as in Hadoop or Google's MapReduce), but Yahoo Labs did call it "Real-Time MapReduce."

http://labs.yahoo.com/event/99

Groxx15y ago

That's not a readme! At best that's a weak INSTALL.

Anyone care to illuminate to me how this works? "Real-time MapReduce" sounds almost like an oxymoron.

The project's site ( http://s4.io/ ) isn't very helpful either. Full source:

  the S4 open source project. coming soon.

varworld15y ago

http://wiki.s4.io/

swah15y ago

It's kind of funny how almost no one is advised to write code in Java these days, and on the other hand really cool infrastructure projects like this or Cassandra are almost always written in Java.

I understand they have to perform better than the average webapp, but still...funny.

beagle315y ago

My recent experience with Java code is that efficiency conscious Java code is ridiculously under-performing compared to efficiency conscious C code.

By virtue of Java's structure and culture (as codified in its libraries), you have to spend approximately twice as much physical memory as you would in C. E.g., I had a 100,000,000 record array that I needed to sort. That's great -- Java has an Array.Utils.sort() method; But I needed to know the _sorting permutation_ as well. That requires an array of Integers, each taking >=16 bytes (so, 1.6GB of ram). Which slows everything to a crawl. So I wrote my own heapsort routine which needed only 400MB of ram. But it's 10 times slower than the equivalent C -- I suspect because of array bounds that the compiler can't throw away.

Those infrastructure projects are "cool", but horribly done. I don't know about Cassandra, but Hadoop makes jobs take between twice and ten times the resources that they need. Sure, it scales to 4000 machines, but you'd only need 1000 if it were properly written. And if you only have a 10-node cluster, you could probably run it on one machine with comparable throughput.

Color me unimpressed with Java.

swah15y ago

I was actually thinking of Ruby/Python folks, perhaps there is a group that will skip Java/C/C++ entirely.

Against C there is always the "Java makes it more difficult for you to mess things up" argument...

1 more reply

schumihan15y ago

You can work with much higher productivity in java...

1 more reply

evgen15y ago

I think it is just you noticing them as Java, not that more are in Java. For every Cassandra there is a Riak, etc.

anandkesari15y ago

@atarashi: S4 is a stream processing platform that has been developed at Yahoo!. The website and github repository are under development. Follow us on twitter http://twitter.com/s4project

yarapavan15y ago

Website is live: http://s4.io/

j / k navigate · click thread line to collapse

15 comments

rektide15y ago

the entire point was that it's not mapreduce, it's a stream processing system. try http://wiki.s4.io/Manual/S4Overview#toc1 for a "what is S4" intro.

and

go stream processing. also see Flume for related recent works.

DanielRibeiro15y ago

Seems very much like Percolator from google: http://www.infoq.com/news/2010/10/google-percolator

schumihan15y ago

atarashiOP15y ago

Indeed, it's not really MapReduce (as in Hadoop or Google's MapReduce), but Yahoo Labs did call it "Real-Time MapReduce."

http://labs.yahoo.com/event/99

Groxx15y ago

That's not a readme! At best that's a weak INSTALL.

Anyone care to illuminate to me how this works? "Real-time MapReduce" sounds almost like an oxymoron.

The project's site ( http://s4.io/ ) isn't very helpful either. Full source:

  the S4 open source project. coming soon.

varworld15y ago

http://wiki.s4.io/

swah15y ago

It's kind of funny how almost no one is advised to write code in Java these days, and on the other hand really cool infrastructure projects like this or Cassandra are almost always written in Java.

I understand they have to perform better than the average webapp, but still...funny.

beagle315y ago

My recent experience with Java code is that efficiency conscious Java code is ridiculously under-performing compared to efficiency conscious C code.

Color me unimpressed with Java.

swah15y ago

I was actually thinking of Ruby/Python folks, perhaps there is a group that will skip Java/C/C++ entirely.

Against C there is always the "Java makes it more difficult for you to mess things up" argument...

1 more reply

schumihan15y ago

You can work with much higher productivity in java...

1 more reply

evgen15y ago

I think it is just you noticing them as Java, not that more are in Java. For every Cassandra there is a Riak, etc.

anandkesari15y ago

@atarashi: S4 is a stream processing platform that has been developed at Yahoo!. The website and github repository are under development. Follow us on twitter http://twitter.com/s4project

yarapavan15y ago

Website is live: http://s4.io/

j / k navigate · click thread line to collapse