我用 Java 编写了一个 mapreduce 程序,我可以将它提交到以分布式模式运行的远程集群。目前,我使用以下步骤提交作业:
- 将 mapreuce 作业导出为 jar(例如
myMRjob.jar
) - 使用以下 shell 命令将作业提交到远程集群:
hadoop jar myMRjob.jar
当我尝试运行程序时,我想直接从 Eclipse 提交作业。我怎样才能做到这一点?
我目前正在使用 CDH3,我的 conf 的精简版是:
conf.set("hbase.zookeeper.quorum", getZookeeperServers());
conf.set("fs.default.name","hdfs://namenode/");
conf.set("mapred.job.tracker", "jobtracker:jtPort");
Job job = new Job(conf, "COUNT ROWS");
job.setJarByClass(CountRows.class);
// Set up Mapper
TableMapReduceUtil.initTableMapperJob(inputTable, scan,
CountRows.MyMapper.class, ImmutableBytesWritable.class,
ImmutableBytesWritable.class, job);
// Set up Reducer
job.setReducerClass(CountRows.MyReducer.class);
job.setNumReduceTasks(16);
// Setup Overall Output
job.setOutputFormatClass(MultiTableOutputFormat.class);
job.submit();
当我直接从 Eclipse 运行它时,作业已启动,但 Hadoop 找不到映射器/减速器。我收到以下错误:
12/06/27 23:23:29 INFO mapred.JobClient: map 0% reduce 0%
12/06/27 23:23:37 INFO mapred.JobClient: Task Id : attempt_201206152147_0645_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: com.mypkg.mapreduce.CountRows$MyMapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996)
at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:212)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:602)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
...
有谁知道如何克服这些错误?如果我能解决这个问题,我可以将更多的 MR 作业集成到我的脚本中,这太棒了!