“hadoop-partitioning”的相关标签问题

0 投票

4 回答

33264 浏览

hadoop - hadoop map reduce中分组比较器有什么用

我想知道为什么在二级mapreduce中使用分组比较器。

根据二次排序的权威指南示例

我们希望键的排序顺序是按年份（升序）然后按温度（降序）：

通过设置一个 partitioner 来按 key 的 year 部分进行分区，我们可以保证同一年的记录去同一个 reducer。然而，这仍然不足以实现我们的目标。partitioner 确保只有一个 reducer 接收一年内的所有记录；它不会改变 reducer 在分区中按键分组的事实。

既然我们已经编写了自己的分区器，它将负责将映射输出键发送到特定的减速器，那么我们为什么要对它进行分组。

提前致谢

2013-02-06T11:54:53.833

0 投票

0 回答

233 浏览

hadoop - Hadoop MapReduce - 单个减速器负载很重

我正在运行一个看起来像的猪脚本

该字段bucketid有 200 个不同的值，因此我将 PARALLEL 设置为 200，期望每个 reducer 处理一组。然而，一些 reducer 什么都不做，而其他 reducer 处理多个组。这背后的想法是什么？

我面临的真正问题是，一个减速器R落后于其他减速器，并且任务日志merging 13GB of data显示（并且减速器处于减速阶段）。但是，根据我的输入数据，我不希望R处理大量数据。完成后，它R生成的输出部分文件只有350 MB（gzip格式），如果我解压缩，它只会出现6 GB. 所以我想知道，为什么日志说merging 13 GB of data减速器正在运行。这背后有什么道理吗？我错过了什么吗？

hadoop mapreduce hadoop-partitioning

2013-03-03T06:06:32.963

0 投票

1 回答

904 浏览

java - 了解用于重叠计算的 mapreduce 算法

我需要帮助理解算法。我首先粘贴了算法解释，然后是我的疑问。

算法：（用于计算记录对之间的重叠）

给定一个用户定义的参数 K，文件 DR(*Format:record_id,data*) 被分成 K 个几乎相等大小的块，使得一个文档的数据 Di 落入第 i/K 个块中。

我们覆盖了 Hadoop 的分区函数，该函数将映射器发出的键映射到减速器实例。每个键 (i,j) 都映射到第 j/K 组中的一个 reducer。

特殊键 i,* 及其关联值，即文档的数据最多被复制 K 次，从而可以在每个 reducer 上传递文档的全部内容。因此，组中的每个reducer 只需要恢复并加载内存中的一个DR 文件块，其大小可以通过改变K 任意小。因此可以计算重叠。这是以复制通过 MapReduce 框架交付的文档为代价的。

疑点：

我做了一些假设：

声明：每个键 (i,j) 都映射到第 j/Kth 组中的一个 reducer。假设：存在 K 个 reduce 节点，并且 key 映射到 j/Kth 个 reduce 节点。

疑问：一些reduce节点是否组合在一起？说 0,1,2 节点被分组为 Group-0？

声明：文档的数据最多被复制K次，这样每个reducer都可以传递文档的全部内容。

所以这意味着K等于否。减速器节点数？如果不是这样，我们就是在浪费计算节点，而不是使用它们，对吗？

主要疑惑：K是否等于Reducer节点的数量？？

希望得到回应！

谢谢！

java hadoop mapreduce elastic-map-reduce hadoop-partitioning

2013-03-10T06:05:39.770

0 投票

2 回答

581 浏览

hadoop - 如何为 Opencl 应用程序使用 hadoop MapReuce 框架？

我正在 opencl 中开发一个应用程序，其基本目标是在 GPU 平台上实现数据挖掘算法。我想使用 Hadoop 分布式文件系统并想在多个节点上执行应用程序。我正在使用 MapReduce 框架，我将基本算法分为两部分，即“Map”和“Reduce”。

我以前从未在 hadoop 中工作过，所以我有一些问题：

我是否已经在 java 中编写了我的应用程序以使用 Hadoop 和 Mapeduce 框架？
我在 opencl 中为 map 和 reduce 编写了内核函数。是否可以将 HDFS 文件系统用于非 Java GPU 计算应用程序？（注意：我不想使用 JavaCL 或 Aparapi）

hadoop mapreduce opencl gpu hadoop-partitioning

2013-03-19T09:30:15.877

0 投票

1 回答

301 浏览

java - Hadoop reducers receiving wrong data

I have a load of JobControls running at the same time, all with the same set of ControlledJobs. Each JobControl is dealing with a different set of input / output files, by date range, but they are all of the type. The problem that I am observing is that the reduce steps are receiving data designed to be processed by a reducer handling a different date range. The date range is set by the Job, used to determine the input and output, and read from the context within the reducer.

This stops if I submit the JobControls sequentially but that's no good. Is this something I should be solving with a custom partitioner? How would I even determine the correct reducer for a key if I don't know which reducer is dealing with my current date-range? Why would the instantiated reducers not be locked to their JobControl?

I have writing all the JobControls, Jobs, Maps and Reduces against their base implementations in Java.

I'm using the 2.0.3-alpha with yarn. Could that have anything to do with it?

I have to be a little careful sharing the code, but here's a sanitised mapper:

And Reducer:

java hadoop mapreduce hadoop-partitioning

2013-03-19T15:53:49.937

0 投票

3 回答

196 浏览