0

我正在运行 Apache 的 Hadoop,并使用该安装提供的 grep 示例。我想知道为什么地图减少百分比显示运行两次?我以为他们只需要跑一次;这让我怀疑我对 map reduce 的理解。我查了一下(http://grokbase.com/t/gg/mongodb-user/125ay1eazq/map-reduce-percentage-seems-running-twice)但确实没有解释,这个链接是针对MongoDB的。

hduser@ubse1:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar grep /user/hduser/grep /user/hduser/grep-output4 ".*woe is me.*"

我在项目 gutenberg .txt 文件上运行它。输出文件是正确的。

如果需要,这是运行命令的输出:

12/08/06 06:56:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/08/06 06:56:57 WARN snappy.LoadSnappy: Snappy native library not loaded
12/08/06 06:56:57 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/06 06:56:58 INFO mapred.JobClient: Running job: job_201208030925_0011
12/08/06 06:56:59 INFO mapred.JobClient:  map 0% reduce 0%
12/08/06 06:57:18 INFO mapred.JobClient:  map 100% reduce 0%
12/08/06 06:57:30 INFO mapred.JobClient:  map 100% reduce 100%
12/08/06 06:57:35 INFO mapred.JobClient: Job complete: job_201208030925_0011
12/08/06 06:57:35 INFO mapred.JobClient: Counters: 30
12/08/06 06:57:35 INFO mapred.JobClient:   Job Counters 
12/08/06 06:57:35 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/06 06:57:35 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=31034
12/08/06 06:57:35 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/06 06:57:35 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/08/06 06:57:35 INFO mapred.JobClient:     Rack-local map tasks=2
12/08/06 06:57:35 INFO mapred.JobClient:     Launched map tasks=2
12/08/06 06:57:35 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=11233
12/08/06 06:57:35 INFO mapred.JobClient:   File Input Format Counters 
12/08/06 06:57:35 INFO mapred.JobClient:     Bytes Read=5592666
12/08/06 06:57:35 INFO mapred.JobClient:   File Output Format Counters 
12/08/06 06:57:35 INFO mapred.JobClient:     Bytes Written=391
12/08/06 06:57:35 INFO mapred.JobClient:   FileSystemCounters
12/08/06 06:57:35 INFO mapred.JobClient:     FILE_BYTES_READ=281
12/08/06 06:57:35 INFO mapred.JobClient:     HDFS_BYTES_READ=5592862
12/08/06 06:57:35 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=65331
12/08/06 06:57:35 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=391
12/08/06 06:57:35 INFO mapred.JobClient:   Map-Reduce Framework
12/08/06 06:57:35 INFO mapred.JobClient:     Map output materialized bytes=287
12/08/06 06:57:35 INFO mapred.JobClient:     Map input records=124796
12/08/06 06:57:35 INFO mapred.JobClient:     Reduce shuffle bytes=287
12/08/06 06:57:35 INFO mapred.JobClient:     Spilled Records=10
12/08/06 06:57:35 INFO mapred.JobClient:     Map output bytes=265
12/08/06 06:57:35 INFO mapred.JobClient:     Total committed heap usage (bytes)=336404480
12/08/06 06:57:35 INFO mapred.JobClient:     CPU time spent (ms)=7040
12/08/06 06:57:35 INFO mapred.JobClient:     Map input bytes=5590193
12/08/06 06:57:35 INFO mapred.JobClient:     SPLIT_RAW_BYTES=196
12/08/06 06:57:35 INFO mapred.JobClient:     Combine input records=5
12/08/06 06:57:35 INFO mapred.JobClient:     Reduce input records=5
12/08/06 06:57:35 INFO mapred.JobClient:     Reduce input groups=5
12/08/06 06:57:35 INFO mapred.JobClient:     Combine output records=5
12/08/06 06:57:35 INFO mapred.JobClient:     Physical memory (bytes) snapshot=464568320
12/08/06 06:57:35 INFO mapred.JobClient:     Reduce output records=5
12/08/06 06:57:35 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1539559424
12/08/06 06:57:35 INFO mapred.JobClient:     Map output records=5
12/08/06 06:57:35 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/06 06:57:35 INFO mapred.JobClient: Running job: job_201208030925_0012
12/08/06 06:57:36 INFO mapred.JobClient:  map 0% reduce 0%
12/08/06 06:57:50 INFO mapred.JobClient:  map 100% reduce 0%
12/08/06 06:58:05 INFO mapred.JobClient:  map 100% reduce 100%
12/08/06 06:58:10 INFO mapred.JobClient: Job complete: job_201208030925_0012
12/08/06 06:58:10 INFO mapred.JobClient: Counters: 30
12/08/06 06:58:10 INFO mapred.JobClient:   Job Counters 
12/08/06 06:58:10 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/06 06:58:10 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=15432
12/08/06 06:58:10 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/06 06:58:10 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/08/06 06:58:10 INFO mapred.JobClient:     Rack-local map tasks=1
12/08/06 06:58:10 INFO mapred.JobClient:     Launched map tasks=1
12/08/06 06:58:10 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14264
12/08/06 06:58:10 INFO mapred.JobClient:   File Input Format Counters 
12/08/06 06:58:10 INFO mapred.JobClient:     Bytes Read=391
12/08/06 06:58:10 INFO mapred.JobClient:   File Output Format Counters 
12/08/06 06:58:10 INFO mapred.JobClient:     Bytes Written=235
12/08/06 06:58:10 INFO mapred.JobClient:   FileSystemCounters
12/08/06 06:58:10 INFO mapred.JobClient:     FILE_BYTES_READ=281
12/08/06 06:58:10 INFO mapred.JobClient:     HDFS_BYTES_READ=505
12/08/06 06:58:10 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=42985
12/08/06 06:58:10 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=235
12/08/06 06:58:10 INFO mapred.JobClient:   Map-Reduce Framework
12/08/06 06:58:10 INFO mapred.JobClient:     Map output materialized bytes=281
12/08/06 06:58:10 INFO mapred.JobClient:     Map input records=5
12/08/06 06:58:10 INFO mapred.JobClient:     Reduce shuffle bytes=0
12/08/06 06:58:10 INFO mapred.JobClient:     Spilled Records=10

编辑Grep 的驱动程序类: Grep.java

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements.  See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership.  The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License.  You may obtain a copy of the License at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.examples;

import java.util.Random;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/* Extracts matching regexs from input files and counts them. */
public class Grep extends Configured implements Tool {
 private Grep() {} // singleton

 public int run(String[] args) throws Exception {
 if (args.length < 3) {
 System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
 ToolRunner.printGenericCommandUsage(System.out);
 return -1;
 }

 Path tempDir =
 new Path("grep-temp-"+
 Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));

 JobConf grepJob = new JobConf(getConf(), Grep.class);

 try {

 grepJob.setJobName("grep-search");
 FileInputFormat.setInputPaths(grepJob, args[0]);

 grepJob.setMapperClass(RegexMapper.class);
 grepJob.set("mapred.mapper.regex", args[2]);
 if (args.length == 4)
 grepJob.set("mapred.mapper.regex.group", args[3]);

 grepJob.setCombinerClass(LongSumReducer.class);
 grepJob.setReducerClass(LongSumReducer.class);

 FileOutputFormat.setOutputPath(grepJob, tempDir);
 grepJob.setOutputFormat(SequenceFileOutputFormat.class);
 grepJob.setOutputKeyClass(Text.class);
 grepJob.setOutputValueClass(LongWritable.class);

 JobClient.runJob(grepJob);

 JobConf sortJob = new JobConf(getConf(), Grep.class);
 sortJob.setJobName("grep-sort");

 FileInputFormat.setInputPaths(sortJob, tempDir);
sortJob.setInputFormat(SequenceFileInputFormat.class);

 sortJob.setMapperClass(InverseMapper.class);

 sortJob.setNumReduceTasks(1); // write a single file
 FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
 sortJob.setOutputKeyComparatorClass // sort by decreasing freq
 (LongWritable.DecreasingComparator.class);

 JobClient.runJob(sortJob);
 }
 finally {
 FileSystem.get(grepJob).delete(tempDir, true);
 }
 return 0;
 }

 public static void main(String[] args) throws Exception {
 int res = ToolRunner.run(new Configuration(), new Grep(), args);
 System.exit(res);
 }

}
4

1 回答 1

0

在文件中有两个工作的统计数据:job: job_201208030925_0011job: job_201208030925_0012。百分比属于这两个工作,因此有 2 个地图进度百分比。

于 2012-08-06T21:30:05.107 回答