java - 如何最好地决定一个巨大字符串的映射器输出/减速器输入

Question

我需要改进使用 HBase 作为源和接收器的 MR 工作。

基本上，我正在从映射器中的 3 个 HBase 表中读取数据，将它们写成一个巨大的字符串，让 reducer 进行一些计算并转储到 HBase 表中。

Table1 ~ 19 million rows.
Table2 ~ 2 million rows.
Table3 ~ 900,000 rows.

映射器的输出是这样的：

HouseHoldId contentID name duration genre type channelId personId televisionID timestamp

这是针对 Table1 的 1 行。同样有 1900 万个映射器输出。

我有兴趣根据 HouseHoldID 值对其进行排序，所以我正在使用这种技术。我对pair的V部分不感兴趣，所以我有点忽略它。我的映射器类定义如下：

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

要完成我的 MR 工作，需要 22 小时才能完成，这根本不是可取的。我应该以某种方式优化它以某种方式运行得更快..

scan.setCaching(750);        
scan.setCacheBlocks(false); 
TableMapReduceUtil.initTableMapperJob (
                                       Table1,           // input HBase table name
                                       scan,                   
                                       AnalyzeMapper.class,    // mapper
                                       Text.class,             // mapper output key
                                       IntWritable.class,      // mapper output value
                                       job);

TableMapReduceUtil.initTableReducerJob(
                                        OutputTable,                // output table
                                        AnalyzeReducerTable.class,  // reducer class
                                        job);
job.setNumReduceTasks(RegionCount);

我的 HBase Table1 有 21 个区域，因此产生了 21 个映射器。我们正在运行一个 8 节点 cloudera 集群。

我在这里做错了吗？

我应该使用自定义 SortComparator 或 Group Comparator 或类似的东西来提高效率吗？

java - 如何最好地决定一个巨大字符串的映射器输出/减速器输入

0 回答 0

Related

Reference