如果要计算输入中的所有记录,则需要使用 GROUP ALL,它会创建一个包。同样出于性能原因,它使用累加器 DISTINCT 函数 org.apache.pig.builtin.Distinct
X = load 'path' as (fqdn:chararray,ip:chararray,date:chararray,time:chararray,uri:chararray,ua:chararray);
IPs = FOREACH X GENERATE ip; // project early for performance reasons
GRP = group IPs all;
OUT = foreach GRP generate COUNT(IPs) as all_cnt, COUNT(org.apache.pig.builtin.Distinct(IPs.ip)) as distinct_cnt;
如果您有太多 IP 并且遇到与内存相关的异常,则可以执行以下操作:
X = load 'path' as (fqdn:chararray,ip:chararray,date:chararray,time:chararray,uri:chararray,ua:chararray);
IPs = FOREACH X GENERATE ip; // project early for performance reasons
Dist_IPs = distinct IPs;
GRP_DIST = group Dist_IPs all;
DIST = foreach GRP_DIST generate COUNT(GRP_DIST) as cnt, 'dist' as category;
GRP_ALL = group IPs all;
ALL = foreach GRP_ALL generate COUNT(GRP_ALL)as cnt, 'all' as category;
OUT = union DIST, ALL;