这是我的代码的一部分。查看将文件复制到 HDFS 并启动 mr-job 的脚本。我确实在 maven 集成测试阶段使用 ant: scp, ssh 目标将此脚本上传到 hadoop 节点。
#dummy script for running mr-job
hadoop fs -rm -r /HttpSample/output
hadoop fs -rm -r /HttpSample/metadata.csv
hadoop fs -rm -r /var/log/hadoop-yarn/apps/cloudera/logs
#hadoop hadoop dfs -put /home/cloudera/uploaded_jars/metadata.csv /HttpSample/metadata.csv
hadoop fs -copyFromLocal /home/cloudera/uploaded_jars/metadata.csv /HttpSample/metadata.csv
hadoop fs -copyFromLocal /home/cloudera/uploaded_jars/opencsv.jar /HttpSample/opencsv.jar
hadoop fs -copyFromLocal /home/cloudera/uploaded_jars/gson.jar /HttpSample/gson.jar
#Run mr job
cd /home/cloudera/uploaded_jars
#hadoop jar scoring-job.jar ru.megalabs.mapreduce.scoringcounter.Main -libjars gson.jar -files hdfs://0.0.0.0:8020/HttpSample/metadata.csv -libjars hdfs://0.0.0.0:8020/HttpSample/opencsv.jar, hdfs://0.0.0.0:8020/HttpSample/gson.jar /HttpSample/raw_traffic.json /HttpSample/output/scoring_result
hadoop jar scoring-job.jar ru.megalabs.mapreduce.scoringcounter.Main -files hdfs://0.0.0.0:8020/HttpSample/metadata.csv -libjars hdfs://0.0.0.0:8020/HttpSample/opencsv.jar,hdfs://0.0.0.0:8020/HttpSample/gson.jar /HttpSample/raw_traffic.json /HttpSample/output/scoring_result
Mapper里面的代码:
public class ScoringCounterMapper extends Mapper<LongWritable, Text, GetReq, IntWritable> {
private static final Log LOG = LogFactory.getLog(ScoringCounterMapper.class);
private static final String METADATA_CSV = "metadata.csv";
private List<RegexMetadata> regexMetadatas = null;
private final static IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//bal-bla-lba
}
@Override
protected void setup(Context context) throws IOException, InterruptedException {
MetadataCsvReader metadataCsvReader = new MetadataCsvReader(new File(METADATA_CSV));
regexMetadatas = metadataCsvReader.getMetadata();
for(RegexMetadata rm : regexMetadatas){
LOG.info(rm);
}
}
}
看到: 1. 我确实将我的元数据文件上传到节点 2. 我确实把它放到了 HDFS 3. 我确实使用 -Files 参数提供了文件路径 4. 我确实指定这个文件在 HDFS 内 (hdfs://0.0 .0.0:8020/HttpSample/metadata.csv)