1

I'm trying to run a Python program on Hadoop. The program involves the NLTK library. The program also utilizes the Hadoop Streaming API, as described here.

mapper.py:

#!/usr/bin/env python
import sys
import nltk
from nltk.corpus import stopwords

#print stopwords.words('english')

for line in sys.stdin:
        print line,

reducer.py:

#!/usr/bin/env python

import sys
for line in sys.stdin:
    print line,

Console command:

bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output

This runs perfectly, with the output simply containing the lines of the input file.

However, when this line (from mapper.py):

#print stopwords.words('english')

is uncommented, then the program fails and says

Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

I have checked and in a standalone python program,

print stopwords.words('english')

works perfectly fine, and so I am absolutely stumped as to why it's causing my Hadoop program to fail.

I would greatly appreciate any help! Thank you

4

2 回答 2

0

Is 'english' a file in print stopwords.words('english')? If yes, you need to use -file for that too to send it across the nodes.

于 2013-09-30T22:07:21.290 回答
0

使用这些命令解压缩:

importer = zipimport.zipimporter('nltk.zip')
    importer2=zipimport.zipimporter('yaml.zip')
    yaml = importer2.load_module('yaml')
    nltk = importer.load_module('nltk')

检查我在上面粘贴的链接。他们提到了所有的步骤。

于 2013-09-27T23:56:28.003 回答