3

If I'm running Pig on a bunch of *.tar.gz files, PigStorage will handle unzipping fine, but the header lines between the files in tar aren't handled. Is there a simple way to handle this? Or do I have to write my own RecordReader? And what would this look like?

4

1 回答 1

5

您可以使用 tar 即时清理标题。在您的 Pig 脚本中,执行以下操作:

--Call to tar that reads from stdin and outputs to stdout
DEFINE CLEANTAR `tar xvf - -O`;

--Now, remove tar headers from your data
cleaned = STREAM mydata THROUGH CLEANTAR;

编辑:添加了以下替代方案。

您还可以使用 sed 删除 tar 标头:

--Remove tar headers using sed
DEFINE CLEANTAR `sed 's/[^\n]*\o000//g'`;

--Now, remove tar headers from your data
cleaned = STREAM mydata THROUGH CLEANTAR;
于 2013-06-11T14:08:43.863 回答