python - Cross Product in Map Reduce using Hadoop Streaming and Python

Question

I am learning Python and Hadoop. I completed the setup and basic examples provided in official site using pythong+hadoop streaming. I considered implementing join of 2 files. I completed equi-join which checks if same key appears in both input files, then it outputs the key along with values from file 1 and file 2 in that order. The equality join is working as it is supposed.

Now, I wish to do inequality join which involves finding Cross Product before applying the inequality condition. I am using the same mapper (do I need to change it) and I changed the reducer so that it contains a nested loop (since every key-value pair in file1 must be matched with all key-values pairs in file2). This doesn't work since you can only go through the stream once. Now, I thought of an option of storing 'some' values in reducer and comparing them but I have no idea 'how' many. Naive method is to store whole file2 content in a array (or similar structure) but thats stupid and goes against the idea of distributed processing. Finally, my questions are

How can I store values in reducer so that I can have cross product between two files?
In equi-join, Hadoop seems to be sending all key value pairs with same key to same reducer which is perfectly fine and works well for that case. However, how I do change this behaviour (if needed) so that required grouping of key-value pairs go correct reducer?

Sample Files: http://pastebin.com/ufYydiPu

Python Map/Reduce Scripts: http://pastebin.com/kEJwd2u1

Hadoop Command I am using:

bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file /home/hduser/mapper.py -mapper mapper.py -file /home/hduser/ireducer.py -reducer reducer.py -input /user/hduser/inputfiles/* -output /user/hduser/join-output

Any help/hint is much appreciated.

score 3 · Accepted Answer

处理多个组合的一种方法可以非常有助于避免嵌套循环，这是使用 itertools 模块。特别是itertools.product函数，它使用生成器处理笛卡尔积。这有利于内存使用、效率，并且如果您必须在一个 map reduce 作业中加入多个数据集，它可以显着简化您的代码。

关于mapper产生的数据和reducer中要合并的数据集的对应关系，如果每个key的数据集不是太大，你可以简单地从mapper中产生如下组合：

{key, [origin_1, values]}
{key, [origin_2, values]}

因此，您将能够将 reducer 中具有相同来源的值分组到字典中，这些字典将是使用 itertools.product 应用笛卡尔积的数据集。

python - Cross Product in Map Reduce using Hadoop Streaming and Python

1 回答 1

Related

Reference