hadoop - Apache Pig - 是否可以序列化变量？

Question

让我们以 wordCount 为例：

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);

-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
bag_words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

是否可以序列化“bag_words”变量，这样我们就不必在每次执行脚本时都重新构建整个包？

谢谢。

score 2 · Accepted Answer

STORE bag_words INTO 'some-output-directory';

然后稍后阅读它以跳过 foreach 生成、展平、标记化。

score 0 · Accepted Answer

您可以使用 STORE 命令在 pig 中输出任何别名：您可以使用标准格式（如 CSV）或编写自己的 PigLoader 类来实现任何特定行为。然后，您可以在单独的脚本中加载此输出，从而绕过初始加载。

hadoop - Apache Pig - 是否可以序列化变量？

2 回答 2

Related

Reference