1

我将首先声明我是一名交易系统管理员和 PIG 新手,所以请保持温和。

我正在尝试使用 PIG 从我们的 CDN 解析 Apache Web 日志。对于一个应用程序,我们可以从 URI 和 3 个不同的应用程序/版本字符串(由应用程序开发中的不一致引起)收集三种不同的调用类型。我需要收集它们并生成一份报告,详细说明每个应用程序/版本的每种调用类型的数量。

调用类型将包含以下内容之一: valid、wms、tile userAgent 字段中的 App 名称可能如下所示:

APP%20NAME/0.0 CFNetwork/609.1.4 达尔文/13.0.0"

Android APP NAME 0.0.0 (SCH-I605 - Android 4.1.2, SDK XX)

APP NAME 0.0.0 (iPhone OS 6.1.3 - iPhone, XXX.XX.XXX.XX.XXXX, XXXXXXXX 0.0)"

这是我在发现 userAgent 命名不一致之前所做的工作。充其量可能是一个黑客,但它正在生产所需的东西。

任何帮助表示赞赏。

register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE LogLoader org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader();
DEFINE DayExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT;
logs = LOAD '$INPUT' USING LogLoader as (remoteAddr, remoteLogname, user, time, method, uri, proto, status, bytes, referer,userAgent);
FILTERED = FILTER logs by userAgent matches '.*MapKit.*' OR userAgent matches '.*Darwin.*' or userAgent matches '.*Android.*';
DARWINONLY = FOREACH FILTERED GENERATE DayExtractor(time) as day, uri, bytes, userAgent;
FILTERVALID = FILTER DARWINONLY BY uri matches '.*valid.*';
FILTERTILE = FILTER DARWINONLY BY uri matches '.*tile.*';
FILTERWMS = FILTER DARWINONLY BY uri matches '.*wms.*';
VALIDAPPTIME = FOREACH FILTERVALID GENERATE day as validframeday, EXTRACT(userAgent, '([^\\s]+)') as validframeapp,bytes as validbytes;
WMSAPPTIME = FOREACH FILTERWMS GENERATE day as wmsday, EXTRACT(userAgent, '([^\\s]+)') as wmsapp,  bytes as wmsbytes;
TILEAPPTIME = FOREACH FILTERTILE GENERATE day as tileday, EXTRACT(userAgent, '([^\\s]+)') as tileapp, bytes as tilebytes;
GROUPWMS = GROUP WMSAPPTIME BY ($0,$1);
GROUPTILE = GROUP TILEAPPTIME BY ($0,$1);
GROUPVALID = GROUP VALIDAPPTIME BY ($0,$1);
WMSAPPCOUNT = FOREACH GROUPWMS GENERATE FLATTEN(group), COUNT($1) as wmsnum, SUM(WMSAPPTIME.wmsbytes) as wmstotalbytes;
VALIDAPPCOUNT = FOREACH GROUPVALID GENERATE FLATTEN(group), COUNT($1) as validnum, SUM(VALIDAPPTIME.validbytes) as validtotalbytes;
TILEAPPCOUNT = FOREACH GROUPTILE GENERATE FLATTEN(group), COUNT($1) as tilenum, SUM(TILEAPPTIME.tilebytes) as tiletotalbytes:int;
Y = COGROUP VALIDAPPCOUNT BY (validframeday,validframeapp), WMSAPPCOUNT BY (wmsday,wmsapp), TILEAPPCOUNT BY (tileday,tileapp);
Z = FOREACH Y GENERATE group as dailyapp, VALIDAPPCOUNT.validnum, VALIDAPPCOUNT.validtotalbytes, WMSAPPCOUNT.wmsnum, WMSAPPCOUNT.wmstotalbytes, TILEAPPCOUNT.tilenum, TILEAPPCOUNT.tiletotalbytes;
STORE Z into '$OUTPUT';
4

0 回答 0