全部,
我正在创建一个接口来处理一些海量数据并生成 arff 文件以进行一些机器学习。我目前可以收集这些功能 - 但我无法将它们与派生它们的文件相关联。我目前正在使用 Dumbo
def mapper(key, value):
#do stuff to generate features
是否有任何方便的方法来确定打开的文件名并将其内容传递给映射器函数?
再次感谢。-山姆
如此处所述,您可以使用 -addpath yes 选项。
-addpath 是(将每个输入键替换为由相应输入文件的路径和原始键组成的元组)
If you're able to access the job configuration properties, then the mapreduce.job.input.file
property should contain the file name of the current file.
I'm not sure how you get at these properties in Dumbo/Mrjob though - the docs specify that periods (in the conf names) are replaced with underscores, and then looking through the source for PipeMapRed.java, looks like everything single job conf property is set as a env variable - so try and access an env variable named mapreduce_job_input_file
http://hadoop.apache.org/mapreduce/docs/r0.21.0/mapred_tutorial.html#Configured+Parameters