6

我有一个巨大的表格文本文件

数据保存在目录data/data1.txt、data2.txt等

merchant_id, user_id, amount
1234, 9123, 299.2
1233, 9199, 203.2
 1234, 0124, 230
 and so on..

我想做的是为每个商家,找到平均金额..

所以基本上最后我想将输出保存在文件中。就像是

 merchant_id, average_amount
  1234, avg_amt_1234 a
  and so on.

我如何计算标准偏差?

很抱歉问了这么一个基本的问题。:( 任何帮助,将不胜感激。 :)

4

4 回答 4

13

Apache PIG 非常适合此类任务。参见示例:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate group as id, sum/count as mean, sum as sum, count as count;
};

请特别注意 amnt 列的数据类型,因为它将影响 SUM 函数 PIG 将调用的实现。

PIG 还可以做一些 SQL 做不到的事情,它可以在不使用任何内部连接的情况下将平均值放在每个输入行上。如果您使用标准差计算 z 分数,这将非常有用。

 mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};

FLATTEN(inpt) 可以解决问题,现在您可以访问对组平均值、总和和计数做出贡献的原始金额。

更新 1:

计算方差和标准差

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
        sum = SUM(inpt.amnt);
        count = COUNT(inpt);
        generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
    dif = (amnt - avg) * (amnt - avg) ;
     generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum; 
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;

它将使用 2 个工作。还没想好怎么做,嗯,需要多花点时间在上面。

于 2012-09-27T10:08:35.300 回答
1

所以你想要什么?您想要正在运行的 java 代码还是抽象的 map-reduce 进程?对于第二个:

地图步骤:

record -> (merchant_id as key, amount as value)

减少步骤:

(merchant_id, amount) -> (merchant_id, aggregate the value you want)

与减少步骤一样,您将获得具有相同键的记录流,您可以做几乎所有可以做的事情,包括平均值、方差。

于 2012-09-26T02:27:52.300 回答
1

您可以一步计算标准偏差;使用公式

var=E(x^2)-(Ex)^2
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double,  id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
    sum = SUM(inpt.amnt);
    sum2 = SUM(inpt.amnt**2);
    count = COUNT(inpt);
    generate flatten(inpt), sum/count as avg, count as count, sum2/count-    (sum/count)**2 as std;
};

而已!

于 2015-10-01T07:15:24.463 回答
0

我仅在 1 个循环中计算了所有统计数据(最小值、最大值、平均值和标准差)。FILTER_DATA 包含数据集。

    GROUP_SYMBOL_YEAR = GROUP FILTER_DATA BY (SYMBOL, SUBSTRING(TIMESTAMP,0,4));
STATS_ALL = FOREACH GROUP_SYMBOL_YEAR { 
    MINIMUM = MIN(FILTER_DATA.CLOSE);
    MAXIMUM = MAX(FILTER_DATA.CLOSE);
    MEAN = AVG(FILTER_DATA.CLOSE);
    CNT = COUNT(FILTER_DATA.CLOSE);
    CSQ = FOREACH FILTER_DATA GENERATE CLOSE * CLOSE AS (CC:DOUBLE);
    GENERATE group.$0 AS (SYMBOL:CHARARRAY), MINIMUM AS (MIN:DOUBLE), MAXIMUM AS (MAX:DOUBLE), ROUND_TO(MEAN,6) AS (MEAN:DOUBLE), ROUND_TO(SQRT(SUM(CSQ.CC) / (CNT * 1.0) - (MEAN * MEAN)),6) AS (STDDEV:DOUBLE), group.$1 AS (YEAR:INT);
};
于 2020-11-10T09:33:58.303 回答