0

我有两个数据集.. main_data.txt

{"id":"foo", "some_field:12354, "score":0}
{"id":"foobar", "some_field:12354, "score":0}

score_data.txt

{"id":"foo", "score":1}
{"id":"foobar","score":20}

……

所以在 main_data.. score 被初始化为 0.. 另外.. main_data 和 score_data 有一些共同的 id..

对于常见的 id:我想将 main_data 中的“分数”替换为 score_data 中的分数

如果元素不存在..那么我想让分数本身为 0..

4

1 回答 1

1

为什么将“分数”初始化为 0?您可以简单地跳过它,加入main_data(LEFT OUTER) 和score_data. 无论您是否跳过,这都应该有效:

main_data = LOAD USING SOME STORAGE; -- asume we have id as column
score_data = LOAD USING SOME STORAGE; -- asume we have id, score as columns
joined_data = JOIN main_data BY main_data::id LEFT OUTER, score_data BY score_data::id;
results = FOREACH joined_data GENERATE main_data::id, (score_data::score IS NULL ? 0 : score_data::score);
STORE results USING SOMETHING SOMEWHERE;
于 2013-10-31T16:51:34.780 回答