我正在运行一个 PIG 脚本,一切都进行得非常快,直到我上FOREACH ... GENERATE FLATTEN(...)
线为止。
有没有理由让这条线跑得这么慢。(它会导致整个脚本在相当强大的集群上超时)
extended = FOREACH kRecords GENERATE *, NORMALIZE(query) AS query_norm:chararray;
-- DESCRIBE extended;
-- extended: {query: chararray,url: chararray,query_norm: chararray}
-- GROUP by both query and url
grouped = GROUP extended BY (query_norm, url);
-- DESCRIBE grouped;
-- grouped: {group: (query_norm: chararray,url: chararray),extended: {(query: chararray,url: chararray,query_norm: chararray)}}
-- Remove multiple items per record (but at the expense of duplicating records)
-- THE LINE BELOW IS THE SLOW ONE!!!
flattened = FOREACH grouped GENERATE FLATTEN(extended.query_norm), FLATTEN(extended.url);
-- THE LINE ABOVE IS THE SLOW ONE!!!
-- Remove duplicates
result = DISTINCT flattened;
谢谢,巴里