Python MapReduce

From Aivwiki

Revision as of 00:59, 19 December 2007 by Mzambrano (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

This is a minimal mapReduce python script that takes input in the form of key, value and returns a unique output. I think I should have called it a worker script.

Intermediate results:

  • A function output which is the result of applying a function to the value.
  • A group, which groups the output of these function on dictionary

The function output is applied to the values. The group is generated throught a dictionary and the final output is generate iteratively over the generated group.

An input file look like:

1 5227
2 45703
3 16591
4 32658
5 34855
6 46832
7 20137
8 24230
9 42876
10 1930
11 7289
12 19143
13 38628
14 25987
KEY VALUE

In this case we apply the function just to number values but it should be straightforward to apply functions to words or to webpages for example.

A typical run will look like:

./mapReduceOK.py A.txt
now comes the first dictionary intermediary:
{'44884': [1], '13356': [1], '30922': [1, 1], '11542': [1, 1, 1, 1, 1, 1, 1], '38842': [1], '11540': [1, 1], '11541': [1, 1, 1], '11546': [1], '11547': [1, 1, 1],'45445': [1, 1, 1, 1, 1],...}
and then the final output, in this case a list:
[...,('24332', 9), ('27263', 9), ('34604', 9), ('49975', 10), ('19555', 10)]

mapReduceOK.py returns the cumulative values of a given number. A.txt has 100.000 key, values pairs and this is processed in 16 secs on my Intel(R) Core(TM)2 CPU T7200@2.00GHz. Half of that time is taken displaying intermediary values.

In this case:

  • The map does: for each value it adds a 1 on the first dictionary. Since it is a dictionary it also group the results.
  • The reduce counts the number of one for each dictionary key(value).

More examples on the code.

Some interesting stuff about this code:

  • It doesn't use that much memory, it just iterate over the file.
  • When it runs, it's heavily CPU intensive.

More interesting stuff about these functions:

  • When you read from the input, the intermediary will be lesser or equal of the size of the input. If you index by result, this intermediary will be probably a lot smaller than the input, depending of the density of the function applied. In both cases the next step will be a lot faster than the first one.
  • You have a good oportunity to paralelize outputs the processing of the input and merge interediary or merge results.

Image:mapreduce.jpg

  • You could also map the input on different servers, serialize and group on one server. Reduce on a bunch of server then merge the results in one.
  • Indexing by results is useful to get frequencies of values.

What is missing:

  • I don't parse the arguments.
  • I don't handle the file in any special way. It's just a file on my file system.
  • Distribution: I just run one instance of the mapReduce program. I do have a multithreaded version but performance is not good.

Files:

Links:

I also wrote a version in java, some problems:

  • No dictionary, I had to implement this functionality with a double linked list and a hashtable to see if values exist.
  • It was very slow, the main issue was that while reading the input it does some kind of UTF decoding/encoding. I'm still look to fix that.

Isn't python great for providing out of the box dictionary?


Mauricio Zambrano 2007