If I'm running Pig on a bunch of *.tar.gz files, PigStorage will handle unzipping fine, but the header lines between the files in tar aren't handled. Is there a simple way to handle this? Or do I have to write my own RecordReader? And what would this look like?
问问题
373 次
1 回答
5
您可以使用 tar 即时清理标题。在您的 Pig 脚本中,执行以下操作:
--Call to tar that reads from stdin and outputs to stdout
DEFINE CLEANTAR `tar xvf - -O`;
--Now, remove tar headers from your data
cleaned = STREAM mydata THROUGH CLEANTAR;
编辑:添加了以下替代方案。
您还可以使用 sed 删除 tar 标头:
--Remove tar headers using sed
DEFINE CLEANTAR `sed 's/[^\n]*\o000//g'`;
--Now, remove tar headers from your data
cleaned = STREAM mydata THROUGH CLEANTAR;
于 2013-06-11T14:08:43.863 回答