python - 文件未在 AWS Elastic Map Reduce 上缓存

Question

我在 AWS Elastic MapReduce 上运行以下 MapReduce：

./elastic-mapreduce --create --stream --name CLI_FLOW_LARGE --mapper s3://classify.mysite.com/mapper.py --reducer s3://classify.mysite.com/reducer.py --input s3n://classify.mysite.com/s3_list.txt --output s3://classify.mysite.com/dat_output4/ --cache s3n://classify.mysite.com/classifier.py#classifier.py --缓存存档 s3n://classify.mysite.com/policies.tar.gz#policies --bootstrap-action s3://classify.mysite.com/bootstrap.sh --enable-debugging --master-instance-type m1.large --slave-instance-type m1.large --instance-type m1.large

由于某种原因，cacheFileclassifier.py似乎没有被缓存。reducer.py尝试导入时出现此错误：

  File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201204290242_0001/attempt_201204290242_0001_r_000000_0/work/./reducer.py", line 12, in <module>
    from classifier import text_from_html, train_classifiers
ImportError: No module named classifier

classifier.py绝对存在于s3n://classify.mysite.com/classifier.py. 对于它的价值，政策档案似乎加载得很好。

score 4 · Accepted Answer

我不知道如何在 EC2 中解决这个问题，但我以前在传统的 Hadoop 部署中使用 Python 看到过这个问题。希望这一课能翻译过来。

我们需要做的是将目录添加reduce.py到 python 路径中，因为大概classifier.py也在那里。无论出于何种原因，这个地方不在 python 路径中，所以它无法找到classifier.

import sys
import os.path

# add the directory where reducer.py is to the python path
sys.path.append(os.path.dirname(__file__))
# __file__ is the location of reduce.py, along with "reduce.py"
# dirname strips the file name and only gives the directory
# sys.path is the python path where it looks for modules

from classifier import text_from_html, train_classifiers

您的代码可能在本地工作的原因是您运行它的当前工作目录。就当前工作目录而言，Hadoop 可能不会从您所在的同一位置运行它。

score 1 · Accepted Answer

从他的评论来看，orangeoctopus 值得称赞。必须附加工作目录系统路径：

sys.path.append('./')

此外，我建议任何与我有类似问题的人阅读这篇关于在 AWS 上使用分布式缓存的精彩文章： https ://forums.aws.amazon.com/message.jspa?messageID=152538

python - 文件未在 AWS Elastic Map Reduce 上缓存

2 回答 2

Related

Reference