hadoop - 映射前在 Hadoop Mapper 中从 DBMS 中查询数据

Question

我对 Hadoop 中的 MapReduce 有点陌生。我正在尝试处理来自许多日志文件的条目。映射器过程与WordCount教程中的过程非常相似。

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        output.collect(word, one);
    }
}

事情不是把这个词作为reducer的键，我想把一个表中的相关数据放在RDBMS中。比如处理后的文字是这样的

apple orange duck apple giraffe horse lion, lion grape

还有一张桌子

name     type
apple    fruit
duck     animal
giraffe  animal
grape    fruit
orange   fruit
lion     animal

所以，我不想数单词，而是数类型。输出就像

fruit 4
animal 5

假设在前面的代码中，它会是这样的

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        String object = tokenizer.nextToken();
        //========================================
        String type = SomeClass.translate(object);
        //========================================
        word.set(type);
        output.collect(word, one);
    }
}

这SomeClass.translate将通过从 RDBMS 查询将对象名称转换为类型。

我的问题

这是可行的吗？（以及如何做到这一点？）
有什么顾虑？我了解到映射器将在多台机器上运行。那么假设有多apple台机器上的话，如何减少查询数据库的次数apple呢？
或者有没有在映射器中进行翻译的非常好的选择？或者也许有一种常见的方法可以做到这一点？（或者这整个问题是一个非常愚蠢的问题？）

更新

我在 Amazon Elastic MapReduce 上使用 Apache Hadoop 实现它，并且转换表存储在 Amazon RDS/MySQL 中。如果您能提供一些示例代码或链接，我将不胜感激。

score 1 · Accepted Answer

总结一下需求，在表中的数据和文件之间进行连接，并对连接的数据进行计数。根据数据的输入大小，可以使用不同的方式（仅限 M 或 MR）连接。有关加入的更多详细信息，请参阅使用 MapReduce 进行数据密集型文本处理- 第 3.5 节。

score 1 · Accepted Answer

如果您担心最小化数据库查询，您可以在两个 MR 作业中执行此操作：首先进行标准字数统计，然后使用该作业的输出进行翻译以输入并重新求和。

或者，如果您的映射表足够小以适合内存，您可以首先将其序列化，将其添加到 DistributedCache，然后将其作为 Mapper 设置方法的一部分加载到内存中。那么就不用担心翻译太多次了，因为它只是一个廉价的内存查找。

hadoop - 映射前在 Hadoop Mapper 中从 DBMS 中查询数据

2 回答 2

Related

Reference