python - 从 Hadoop mapreduce 作业打开 HDFS 上的文件

Question

通常，我可以使用以下内容打开一个新文件：

aDict = {}
with open('WordLists/positive_words.txt', 'r') as f:
    aDict['positive'] = {line.strip() for line in f}

with open('WordLists/negative_words.txt', 'r') as f:
    aDict['negative'] = {line.strip() for line in f}

这将打开 WordLists 文件夹中的两个相关文本文件，并将每一行作为正数或负数附加到字典中。

但是，当我想在 Hadoop 中运行 mapreduce 作业时，我认为这行不通。我正在像这样运行我的程序：

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -file hadoop_map.py -mapper hadoop_reduce.py -input /toBeProcessed -output /Completed

我试图将代码更改为：

with open('/mapreduce/WordLists/negative_words.txt', 'r')

其中 mapreduce 是 HDFS 上的一个文件夹，WordLists 是一个包含否定词的子文件夹。但是我的程序没有找到这个。我正在做什么可能，如果是这样，在 HDFS 上加载文件的正确方法是什么。

编辑

我现在尝试过：

with open('hdfs://localhost:9000/mapreduce/WordLists/negative_words.txt', 'r')

这似乎做了一些事情，但现在我得到了这种输出：

13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 50%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%

然后工作失败。所以还是不对。有任何想法吗？

编辑2：

重新阅读 API 后，我注意到我可以使用-files终端中的选项来指定文件。API 声明：

-files 选项在任务的当前工作目录中创建一个符号链接，指向文件的本地副本。

在此示例中，Hadoop 自动在任务的当前工作目录中创建一个名为 testfile.txt 的符号链接。此符号链接指向 testfile.txt 的本地副本。

-files hdfs://host:fs_port/user/testfile.txt

因此，我运行：

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -files hdfs://localhost:54310/mapreduce/SentimentWordLists/positive_words.txt#positive_words -files hdfs://localhost:54310/mapreduce/SentimentWordLists/negative_words.txt#negative_words -file hadoop_map.py -mapper hadoop_map.py -input /toBeProcessed -output /Completed

根据我对 API 的理解，这会创建符号链接，因此我可以在代码中使用“positive_words”和“negative_words”，如下所示：

with open('negative_words.txt', 'r')

但是，这仍然行不通。任何人都可以提供的任何帮助将不胜感激，因为在我解决这个问题之前我无能为力。

编辑3：

我可以使用这个命令：

-file ~/Twitter/SentimentWordLists/positive_words.txt

连同我的其余命令一起运行 Hadoop 作业。这会在我的本地系统而不是 HDFS 上找到文件。这不会引发任何错误，因此它在某处被接受为文件。但是，我不知道如何访问该文件。

score 2 · Accepted Answer

大量评论后的解决方案:)

在 python 中读取数据文件：将其发送-file并添加到您的脚本中：

import sys

有时需要在后面加上import：

sys.path.append('.')

（与 Hadoop Streaming 中的 @DrDee 评论相关- 无法找到文件错误）

score 0 · Accepted Answer

在以编程方式处理 HDFS 时，您应该查看 FileSystem、FileStatus 和 Path。这些是允许您在程序中访问 HDFS 的 hadoop API 类。

python - 从 Hadoop mapreduce 作业打开 HDFS 上的文件

2 回答 2

Related

Reference