I'm trying to run a Python program on Hadoop. The program involves the NLTK library. The program also utilizes the Hadoop Streaming API, as described here.
mapper.py:
#!/usr/bin/env python
import sys
import nltk
from nltk.corpus import stopwords
#print stopwords.words('english')
for line in sys.stdin:
print line,
reducer.py:
#!/usr/bin/env python
import sys
for line in sys.stdin:
print line,
Console command:
bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output
This runs perfectly, with the output simply containing the lines of the input file.
However, when this line (from mapper.py):
#print stopwords.words('english')
is uncommented, then the program fails and says
Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.
I have checked and in a standalone python program,
print stopwords.words('english')
works perfectly fine, and so I am absolutely stumped as to why it's causing my Hadoop program to fail.
I would greatly appreciate any help! Thank you