java - 使用 Hadoop 连接两个需要两个 map 和一个 reduce 的数据集

Question

可能重复：
相当于 mongo 的 out:reduce 选项在 hadoop

我有 2 个数据集，一个是另一个的补充。它看起来像这样（不是实际字段）：

Question
========
id(key)
name
description

Answer
========
id(key)
type
question_id

Output
======
question_id (key)
name
description
type_a_count
type_b_count

我想知道每个问题有多少特定类型的答案。我曾经使用 mongodb 的 map reduce 引擎来执行此操作，方法是发出我的问题映射器的相同字段（但归零），除了 type_count 字段中的一个，然后将所有内容添加到我的 reducer 中。我现在遇到的问题是，当我运行答案映射器时，我的问题映射器中的值被答案映射器中的值覆盖。

我正在寻找相当于 mongodb 的 {out: "reduce"} 选项。

更多细节：

我只为我的问题映射器使用映射器
两个作业的 outputURI 是相同的，因为我希望它合并
我想使用问题映射器的输出和答案映射器的输出作为我的减速器的输入

score 1 · Accepted Answer

这个答案可能符合您的喜好，也可能不符合您的喜好。我知道您标记了 java，但是有一个名为 cascalog 的库（用 clojure 编写）可用于编写 hadoop 查询。这很简单：

$ lein repl
REPL started; server listening on localhost port 16309
myapp=> (use 'cascalog.playground)
nil
myapp=> (bootstrap)
nil
myapp=> (def questions [["1" "what?" "desc what"] ["2" "where?" "Desc where"]])
#'myapp/questions
myapp=> (def answers [["1" "a" "1"]["2" "a" "1"]["3" "a" "1"]["4" "b" "2"]])
#'myapp/answers
myapp=> (?<- (stdout) [?type ?name ?desc ?count] (questions ?qid ?name ?desc) (answers ?aid ?type ?qid) (c/count ?count))

RESULTS
-----------------------
a       what?   desc what   3
b       where?  Desc where  1

以下是了解 cascalog 的良好起点：http: //nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html。

java - 使用 Hadoop 连接两个需要两个 map 和一个 reduce 的数据集

更多细节：

1 回答 1

Related

Reference