hadoop - 如何从 Pig 中的关系生成自定义模式？

Question

我有一个描述各种文章中单词的 tf-idf 值的模式。它的描述如下：

tfidf_relation: {word: chararray,id: bytearray,tfidf: double}

以下是此类数据的示例：

(cat,article_one,0.13515503603605478)
(cat,article_two,0.4054651081081644)
(dog,article_one,0.3662040962227032)
(apple,article_three,0.3662040962227032)
(orange,article_three,0.3662040962227032)
(parrot,article_one,0.13515503603605478)
(parrot,article_three,0.13515503603605478)

我想以一种形式获得输出：cat article_one 0.13515503603605478，article_two 0.4054651081081644 等等。问题是，我如何从中建立一个包含单词字段和 id 和 tfidf 字段元组的关系？像这样：

X = FOREACH tfidf_relation GENERATE word, (id, tfidf);

不起作用。什么是正确的语法？

score 1 · Accepted Answer

尝试这个：

    t = LOAD 'input/file' USING PigStorage(',') as (word: chararray,id: bytearray,tfidf: double);
    u = group t by word;
    dump u;

输出将是

    (cat,{(cat,article_two,0.4054651081081644),(cat,article_one,0.13515503603605478)})
    (dog,{(dog,article_one,0.3662040962227032)})
    (apple,{(apple,article_three,0.3662040962227032)})
    (orange,{(orange,article_three,0.366204096222703)})
    (parrot,{(parrot,article_three,0.13515503603605478),
    (parrot,article_one,0.13515503603605478)})

我希望这就是你要找的。

score 0 · Accepted Answer

0

X = FOREACH tfidf_relation GENERATE word, {(id, tfidf)};

This is probably what you need.

于 2011-04-18T20:59:34.680 回答

hadoop - 如何从 Pig 中的关系生成自定义模式？

2 回答 2

Related

Reference