Apache PIG 非常适合此类任务。参见示例:
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate group as id, sum/count as mean, sum as sum, count as count;
};
请特别注意 amnt 列的数据类型,因为它将影响 SUM 函数 PIG 将调用的实现。
PIG 还可以做一些 SQL 做不到的事情,它可以在不使用任何内部连接的情况下将平均值放在每个输入行上。如果您使用标准差计算 z 分数,这将非常有用。
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};
FLATTEN(inpt) 可以解决问题,现在您可以访问对组平均值、总和和计数做出贡献的原始金额。
更新 1:
计算方差和标准差:
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
dif = (amnt - avg) * (amnt - avg) ;
generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum;
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;
它将使用 2 个工作。还没想好怎么做,嗯,需要多花点时间在上面。