Write your first MapReduce program in 20 minutes (opens in new tab)

(michaelnielsen.org)

96 pointsthe_gws14y ago12 comments

12 comments

And here's the equivalent Ruby program. In Ruby it is usually 'collect' and 'inject' instead of 'map' and 'reduce'.

    result = Dir.glob('test?.txt').collect do |file_name|
      File.new(file_name, 'r').read.split(' ').collect do |word|
        word.downcase.tr '.,\'', ''
      end.inject Hash.new(0) do |hash,word|
        hash[word] += 1
        hash
      end
    end.inject do |all,hash|
      (all.keys + hash.keys).uniq.inject Hash.new(0) do |acc,word|
        acc[word] = all[word] + hash[word]
        acc
      end
    end
    p result

Edit: The Python example in the article is better because it merges hashes in the reduce step which facilitates parallelisation.

peteretep14y ago

Here's a Perl implementation:

  use strict; use warnings;
  use List::Util  qw(reduce);
  use File::Slurp qw(read_file);

  sub word_count {
  	reduce { $a->{$b}++; $a  } {},
  	map    { split(/\W+/)    }
  	map    { lc read_file $_ } @_;
  }

Dn_Ab14y ago

If i understand correctly collect is more like map from functional programing languages and inject is a fold. One maps a collection of one type to a collection of another type, and the other reduces a collection of one type into something of another type (which could be a collection itself) in that case MapReduce does not quite translate to ruby's collect and inject. In fancy language the reduce in MapReduce is not merely a catamorphism nor is the map actually a collect. The types do not align. I learned of his from a hackernews post by grav1tas http://news.ycombinator.com/item?id=2477238. In there he links to a paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104....

If you look at the types implicit in his code you should see that the essence lies not with collect or inject which are incidental plumbing but the nature of the hash and in particular, the act of generating the intermediate collection for each key. What is called map is actually more a reverse reduce. The pedagogical emphasis on map and foldr actually belies the true nature and power of algorithm in parallelizing.

cdcarter14y ago

You can also use Enumerable#map and Enumerable#reduce, to use names that match the pattern (and #map is arguably more idiomatic than #collect).

chrismealy14y ago

I finally got inject() when they added the reduce() alias.

1 more reply

strmpnk14y ago

Kind of a odd example just to use map/reduce where each works just fine:

    result = Hash.new(0)
    Dir['test?.txt'].each do |file|
      File.read(file).split(' ').each {|word|
        result[word.downcase.tr('.,\'', '')] += 1}
    end

If you really needed map/reduce here you'd probably want to write it this way:

    Dir['test?.txt'].map do |file|
      Hash.new(0).tap {|result|
        File.read(file).split(' ').each {|word|
          result[word.downcase.tr('.,\'', '')] += 1}}
    end.reduce do |result, partial|
      result.merge(partial) {|k, v1, v2| v1 + v2}
    end

wedesoft14y ago

Interesting. I didn't know that you could pass a block to merge.

derwiki14y ago

MRJob is a great way to get off the ground and running with MapReduce:

http://engineeringblog.yelp.com/2010/10/mrjob-distributed-co...

https://github.com/Yelp/mrjob

cdcarter14y ago

If you really want to try and play with MapReduce on a dataset without going through setting up a Hadoop node (or 4), check out CouchDB. It's designed around MapReduces (though not distributed), and you even get to deal with solving re-reduce problems.

klaruz14y ago

BigCouch will let you build indexes in parallel.

pangram14y ago

I wrote a little shim that allowed to write me to write Hadoop jobs in Clojure, and had two small test functions that would apply a map / reduce to a test file -- it made development of Hadoop jobs a bit easier. See: https://github.com/brool/hadoop-shim/blob/master/wordcount.c...

j / k navigate · click thread line to collapse

12 comments

wedesoft14y ago

And here's the equivalent Ruby program. In Ruby it is usually 'collect' and 'inject' instead of 'map' and 'reduce'.

    result = Dir.glob('test?.txt').collect do |file_name|
      File.new(file_name, 'r').read.split(' ').collect do |word|
        word.downcase.tr '.,\'', ''
      end.inject Hash.new(0) do |hash,word|
        hash[word] += 1
        hash
      end
    end.inject do |all,hash|
      (all.keys + hash.keys).uniq.inject Hash.new(0) do |acc,word|
        acc[word] = all[word] + hash[word]
        acc
      end
    end
    p result

Edit: The Python example in the article is better because it merges hashes in the reduce step which facilitates parallelisation.

peteretep14y ago

Here's a Perl implementation:

  use strict; use warnings;
  use List::Util  qw(reduce);
  use File::Slurp qw(read_file);

  sub word_count {
  	reduce { $a->{$b}++; $a  } {},
  	map    { split(/\W+/)    }
  	map    { lc read_file $_ } @_;
  }

Dn_Ab14y ago

cdcarter14y ago

You can also use Enumerable#map and Enumerable#reduce, to use names that match the pattern (and #map is arguably more idiomatic than #collect).

chrismealy14y ago

I finally got inject() when they added the reduce() alias.

1 more reply

strmpnk14y ago

Kind of a odd example just to use map/reduce where each works just fine:

    result = Hash.new(0)
    Dir['test?.txt'].each do |file|
      File.read(file).split(' ').each {|word|
        result[word.downcase.tr('.,\'', '')] += 1}
    end

If you really needed map/reduce here you'd probably want to write it this way:

    Dir['test?.txt'].map do |file|
      Hash.new(0).tap {|result|
        File.read(file).split(' ').each {|word|
          result[word.downcase.tr('.,\'', '')] += 1}}
    end.reduce do |result, partial|
      result.merge(partial) {|k, v1, v2| v1 + v2}
    end

wedesoft14y ago

Interesting. I didn't know that you could pass a block to merge.

derwiki14y ago

MRJob is a great way to get off the ground and running with MapReduce:

http://engineeringblog.yelp.com/2010/10/mrjob-distributed-co...

https://github.com/Yelp/mrjob

cdcarter14y ago

klaruz14y ago

BigCouch will let you build indexes in parallel.

pangram14y ago

j / k navigate · click thread line to collapse