让我们以 wordCount 为例:
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
bag_words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
是否可以序列化“bag_words”变量,这样我们就不必在每次执行脚本时都重新构建整个包?
谢谢。