python - 有没有办法确定传递给 Hadoop/Dumbo/Mrjob 中映射作业的文件名？

Question

全部，

我正在创建一个接口来处理一些海量数据并生成 arff 文件以进行一些机器学习。我目前可以收集这些功能 - 但我无法将它们与派生它们的文件相关联。我目前正在使用 Dumbo

def mapper(key, value):
    #do stuff to generate features

是否有任何方便的方法来确定打开的文件名并将其内容传递给映射器函数？

再次感谢。-山姆

score 1 · Accepted Answer

如此处所述，您可以使用 -addpath yes 选项。

-addpath 是（将每个输入键替换为由相应输入文件的路径和原始键组成的元组）

score 1 · Accepted Answer

If you're able to access the job configuration properties, then the mapreduce.job.input.file property should contain the file name of the current file.

I'm not sure how you get at these properties in Dumbo/Mrjob though - the docs specify that periods (in the conf names) are replaced with underscores, and then looking through the source for PipeMapRed.java, looks like everything single job conf property is set as a env variable - so try and access an env variable named mapreduce_job_input_file

http://hadoop.apache.org/mapreduce/docs/r0.21.0/mapred_tutorial.html#Configured+Parameters

python - 有没有办法确定传递给 Hadoop/Dumbo/Mrjob 中映射作业的文件名？

2 回答 2

Related

Reference