0

I need to process one billion of records periodically. The unique keys can be in range of 10 millions. Value is string with maximum 200K chars.

Here are my questions:

  1. Is the key space very large (10 millions). Would Hadoop be able to handle such a large key space? There will be one reducer per key, so there will be millions of reducers.

  2. I want to update the DB in the reducer itself. In the reducer, I will merge the values (say it current value), read existing value from DB (say it existing value), merge current and existing value and update the DB. Is this a right strategy?

  3. How many reducers can run per box simultaneously? Is it configurable? If only a single reducer runs per box at a time, it will be problem, as I won't be able to update the state for keys in DB very fast.

  4. I want the job to get completed in 2-3 hours. How many boxes would I need ( I can spare max 50 boxes - 64 GB RAM, 8 Core machines)

Thanks

4

1 回答 1

3

回答您的问题:

一个。您对减速器之间的键值分布有错误的概念。reducer 的数量不等于唯一映射器输出键的数量。这个概念是 - 与映射器中的键关联的所有值都转到单个减速器。这绝不意味着 reducer 只会得到一个键。

例如,考虑以下映射器输出:

Mapper(k1,v1), Mapper(k1,v2), Mapper(k1,v3)
Mapper(k2,w1), Mapper(k2,w2)
Mapper(k3,u1), Mapper(k3,u2), Mapper(k3,u3), Mapper(k3,u4)

因此,与k1 - v1,v2v3相关的值将进入单个减速器,例如R1,它不会被拆分为多个减速器。但这并不意味着 R1 将只有 1 个密钥k1来处理。它也可能具有k2k3的值。但是对于 reducer 接收到的任何 key,与该 key 关联的所有值都将来自同一个 reducer。希望它能消除你的疑问。

湾。您使用的是哪个数据库?要减少 DB 调用或更新语句,您可以在完成与特定键相关的值的循环后,在 reducer() 末尾进行查询。

例如:

public static class ReduceJob extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

        @Override
        public synchronized void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output,
                Reporter reporter) throws IOException {


            while (values.hasNext()) {
                      // looping through the values
            }
            // have your DB update etc. query here to reduce DB calls
      }
}

C。是的,reducer 的数量是可配置的。如果要根据作业设置它,可以在作业代码 run() 方法中添加一行来设置减速器的数量。

jobConf.set("mapred.reduce.tasks", numReducers)

如果你想在每台机器的基础上设置它,即集群中每台机器应该有多少个reducer,那么你需要将集群的hadoop配置更改为:

mapred.tasktracker.{map|reduce}.tasks.maximum - MapReduce 任务的最大数量,这些任务分别在给定的 TaskTracker 上同时运行。默认为 2(2 个 map 和 2 个 reduce),但会根据您的硬件而有所不同。

更多细节在这里:http ://hadoop.apache.org/docs/stable/cluster_setup.html#Configuring+the+Hadoop+Daemons

d。如果您的数据文件不是 gZipped(hadoop InputSplit 不适用于 gZipped 文件),那么按照您所说的,您大约有 200 * 1024 * 10 亿字节 = 204800 GB 或 204.800 TB 数据,所以如果你想得到它在 2-3 小时内完成,最好保留所有 50 个盒子,如果减速器的内存占用低,则根据最后一个答案增加每台机器的减速器数量。此外,将 InputSplit 大小增加到 128MB 左右可能会有所帮助。

谢谢并恭祝安康。
卡蒂克亚·辛哈

于 2013-04-23T19:17:39.397 回答