hadoop - Hadoop Map Reduce queries for large key spaces

Question

I need to process one billion of records periodically. The unique keys can be in range of 10 millions. Value is string with maximum 200K chars.

Here are my questions:

Is the key space very large (10 millions). Would Hadoop be able to handle such a large key space? There will be one reducer per key, so there will be millions of reducers.
I want to update the DB in the reducer itself. In the reducer, I will merge the values (say it current value), read existing value from DB (say it existing value), merge current and existing value and update the DB. Is this a right strategy?
How many reducers can run per box simultaneously? Is it configurable? If only a single reducer runs per box at a time, it will be problem, as I won't be able to update the state for keys in DB very fast.
I want the job to get completed in 2-3 hours. How many boxes would I need ( I can spare max 50 boxes - 64 GB RAM, 8 Core machines)

Thanks

score 3 · Accepted Answer

回答您的问题：

一个。您对减速器之间的键值分布有错误的概念。reducer 的数量不等于唯一映射器输出键的数量。这个概念是 - 与映射器中的键关联的所有值都转到单个减速器。这绝不意味着 reducer 只会得到一个键。

例如，考虑以下映射器输出：

Mapper(k1,v1), Mapper(k1,v2), Mapper(k1,v3)
Mapper(k2,w1), Mapper(k2,w2)
Mapper(k3,u1), Mapper(k3,u2), Mapper(k3,u3), Mapper(k3,u4)

因此，与k1 - v1,v2和v3相关的值将进入单个减速器，例如R1，它不会被拆分为多个减速器。但这并不意味着 R1 将只有 1 个密钥k1来处理。它也可能具有k2或k3的值。但是对于 reducer 接收到的任何 key，与该 key 关联的所有值都将来自同一个 reducer。希望它能消除你的疑问。

湾。您使用的是哪个数据库？要减少 DB 调用或更新语句，您可以在完成与特定键相关的值的循环后，在 reducer() 末尾进行查询。

例如：

public static class ReduceJob extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

        @Override
        public synchronized void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output,
                Reporter reporter) throws IOException {


            while (values.hasNext()) {
                      // looping through the values
            }
            // have your DB update etc. query here to reduce DB calls
      }
}

C。是的，reducer 的数量是可配置的。如果要根据作业设置它，可以在作业代码 run() 方法中添加一行来设置减速器的数量。

jobConf.set("mapred.reduce.tasks", numReducers)

如果你想在每台机器的基础上设置它，即集群中每台机器应该有多少个reducer，那么你需要将集群的hadoop配置更改为：

mapred.tasktracker.{map|reduce}.tasks.maximum - MapReduce 任务的最大数量，这些任务分别在给定的 TaskTracker 上同时运行。默认为 2（2 个 map 和 2 个 reduce），但会根据您的硬件而有所不同。

更多细节在这里：http ://hadoop.apache.org/docs/stable/cluster_setup.html#Configuring+the+Hadoop+Daemons

d。如果您的数据文件不是 gZipped（hadoop InputSplit 不适用于 gZipped 文件），那么按照您所说的，您大约有 200 * 1024 * 10 亿字节 = 204800 GB 或 204.800 TB 数据，所以如果你想得到它在 2-3 小时内完成，最好保留所有 50 个盒子，如果减速器的内存占用低，则根据最后一个答案增加每台机器的减速器数量。此外，将 InputSplit 大小增加到 128MB 左右可能会有所帮助。

谢谢并恭祝安康。
卡蒂克亚·辛哈

hadoop - Hadoop Map Reduce queries for large key spaces

1 回答 1

Related

Reference