hadoop - 如何确定 Hadoop 中正确的映射器数量？

Question

我为我的 Hadoop 程序提供了一个 4MB 大小的输入文件（有 10 万条记录）。由于每个 HDFS 块为 64 MB，并且文件仅适合一个块，因此我选择映射器的数量为 1。但是，当我增加映射器的数量（让我们坐到 24）时，运行时间会变得更好。我不知道为什么会这样？因为所有文件只能由一个映射器读取。

算法的简要描述：使用该configure函数从 DistributeCache 读取集群，并将其存储在名为的全局变量clusters中。mapper逐行读取每个chunk，找到每行所属的cluster。以下是一些代码：

public void configure(JobConf job){
        //retrieve the clusters from DistributedCache 
        try {               
            Path[] eqFile = DistributedCache.getLocalCacheFiles(job);
            BufferedReader reader = new BufferedReader(new FileReader(eqFile[0].toString()));               


            while((line=reader.readLine())!=null){
                //construct the cluster represented by ``line`` and add it to a global variable called ``clusters``

                }


            reader.close();             

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

和映射器

 public void map(LongWritable key, Text value, OutputCollector<IntWritable, EquivalenceClsAggValue> output, Reporter reporter) throws IOException {
         //assign each record to one of the existing clusters in ``clusters''.

        String record = value.toString();
        EquivalenceClsAggValue outputValue = new EquivalenceClsAggValue();
        outputValue.addRecord(record);
        int eqID = MondrianTree.findCluster(record, clusters);
        IntWritable outputKey = new IntWritable(eqID);
        output.collect(outputKey,outputValue);          
    }

我有不同大小的输入文件（从 4 MB 到 4 GB）。我怎样才能找到映射器/减速器的最佳数量？我的 Hadoop 集群中的每个节点都有 2 个核心，我有 58 个节点。

score 0 · Accepted Answer

因为所有文件只能由一个映射器读取。

事实并非如此。有几点要记住...

该单个块被复制 3 次（默认情况下），这意味着三个独立的节点可以访问同一个块，而无需通过网络
没有理由不能将单个块复制到多台机器上，然后它们会在这些机器上寻找分配给它们的分割

score 0 · Accepted Answer

您需要调整“mapred.max.split.size”。以字节为单位给出适当的大小作为值。MR 框架将根据这个和块大小计算正确的映射器数量。

hadoop - 如何确定 Hadoop 中正确的映射器数量？

2 回答 2

Related

Reference