hadoop - 为什么我接下来不能处理我的 hadoop 程序？

Question

大家！我在eclipse中有一个关于hadoop的程序，源码是：

public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while(itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}

public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException {
        int sum = 0;
        for(IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

public class WordCount {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] oargs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if(oargs.length != 2) {
            System.err.println("Usage: word count <in> <out>");
        }
        System.out.println("input:  "+oargs[0]);
        System.out.println("output: "+oargs[1]);
        Job job = new Job(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(oargs[0]));
        FileOutputFormat.setOutputPath(job, new Path(oargs[1]));
        System.out.println("==============================");
        System.out.println("start ...");
        boolean flag = job.waitForCompletion(true);
            System.out.println(flag);
        System.out.println("end ...");
        System.out.println("==============================");
    }
}

结果是，请看日志：

rory@0303 /cygdrive/f/develop/hadoop/hadoop-1.0.3
$ ./bin/hadoop jar ./jar/wordcount.jar /tmp/input /tmp/output
input:  /tmp/input
output: /tmp/output
==============================
start ...
12/07/25 14:59:17 INFO input.FileInputFormat: Total input paths to process : 2
12/07/25 14:59:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/07/25 14:59:17 WARN snappy.LoadSnappy: Snappy native library not loaded
12/07/25 14:59:17 INFO mapred.JobClient: Running job: job_201207251447_0001
12/07/25 14:59:18 INFO mapred.JobClient:  map 0% reduce 0%

日志不会继续下去，永远停在那里。为什么？

我正在本地模式下运行代码，通过 windows xp 系统中的 cygwin 软件。

score 0 · Accepted Answer

我想如果你问为什么你从来没有看到end ====================println 部分，然后检查你的代码：

System.exit(job.waitForCompletion(true)?0:1);
System.out.println("end ...");
System.out.println("==============================");

您使用job.waitForCompletion(true)a 包装调用System.exit，因此 JVM 将在最后两个 System.out 可以执行之前终止。

编辑

这里的日志附加器/记录器消息是任何其他异常可能被吞没的线索。您应该修改代码的签名以利用 ToolRunner 实用程序：

public class WordCount {
  public static void main(String[] args) throws Exception {
    ToolRunner.run(new WordCount(), args);  
  }

  public int run(String args[]) {
    if(args.length != 2) {
        System.err.println("Usage: word count <in> <out>");
    }
    System.out.println("input:  "+args[0]);
    System.out.println("output: "+args[1]);
    Job job = new Job(getConf(), "word count");
    Configuration conf = job.getConf();

    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    System.out.println("==============================");
    System.out.println("start ...");
    int result = job.waitForCompletion(true) ? 0 : 1;
    System.out.println("end ...");
    System.out.println("==============================");

    return results
  }
}

并且您应该使用 $HADOOP_HOME/bin/hadoop 脚本将您的作业提交到集群（如下所示，您需要替换 jar 的名称和 WordCount 类的完全限定名称）：

#> hadoop jar wordcount.jar WordCount input output

score 0 · Accepted Answer

@Rory，正如托马斯所问的那样，您能否更具体地说明“下一步做”？这是您在屏幕上看到的整个堆栈跟踪吗？你的意思是你编译过一次然后得到结果就不能再运行了？您是否在 Eclipse IDE 上为您的程序指定了正确的输入参数，即输入和输出目录？

如果您的意思是第二次无法再次运行该程序，可能是您没有指定不同的输出目录。但我想在看到堆栈跟踪后情况并非如此。

hadoop - 为什么我接下来不能处理我的 hadoop 程序？

2 回答 2

Related

Reference