python - 仅针对大文件从 python 调用 zgrep 系统后出错

Question

我正在使用 python 脚本进行系统调用，并且仅使用该选项zgrep打印第一个结果。-m1

脚本：

#! /usr/bin/env python2.7

import subprocess

print subprocess.check_output("zgrep -m1 'a' test.txt.gz", shell=True)

错误：

在大文件 (+2MB) 上运行脚本时，会生成以下错误。

> ./broken-zgrep.py

gzip: stdout: Broken pipe
Traceback (most recent call last):
  File "./broken-zgrep.py", line 25, in <module>
    print subprocess.check_output("zgrep -m1 'a' test.txt.gz", shell=True)
  File "/usr/intel/pkgs/python/2.7/lib/python2.7/subprocess.py", line 537, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command 'zgrep -m1 'a' test.txt.gz' returned non-zero exit status 2

但是，如果我复制 python 抱怨的命令并直接在 shell 中运行它，它就可以正常工作。

> zgrep -m1 'a' test.txt.gz
0000000 8c82 524d 67a4 c37d 0595 a457 b110 3192

该命令的退出状态是0在shell中手动运行后，表示成功。Python 说命令以错误代码退出2。

> echo $?
0

以下是如何制作示例测试文件以重现错误。它创建一个 100000 行的随机值十六进制文件并用于gzip压缩它。

cat /dev/urandom | hexdump | head -n 100000 | gzip > test.txt.gz

看似无关的更改将防止错误：

制作一个较小的测试文件

cat /dev/urandom | hexdump | head -n 100 | gzip > test.txt.gz
不带-m1选项运行（警告：将垃圾邮件终端）

print subprocess.check_output("zgrep 'a' test.txt.gz", shell=True)
在未压缩的文件上使用grep而不是zgrep

cat /dev/urandom | hexdump | head -n 100000 > test.txt

print subprocess.check_output("grep -m1 'a' test.txt", shell=True)
在中运行等效命令perl

perl -e 'print `zgrep -m1 'a' test.txt.gz`'

我不知道为什么python, zgrep, -moption 和 large files 的组合会产生这个错误。如果消除了这些因素中的任何一个，则没有错误。

我对原因的最佳猜测是阅读grep man有关该-m选项的页面。

   -m NUM, --max-count=NUM
          Stop reading a file after NUM matching lines.  If the  input  is
          standard  input  from a regular file, and NUM matching lines are
          output, grep ensures that the standard input  is  positioned  to
          just  after the last matching line before exiting, regardless of
          the presence of trailing context lines.  This enables a  calling
          process  to resume a search.  When grep stops after NUM matching
          lines, it outputs any trailing context lines.

我最初假设该-m选项会grep在找到 NUM 个匹配项后简单地导致退出。但也许有一些有趣的事情发生在grep标准输入上。这仍然不能解释为什么错误只发生在大型压缩文件中。

我最终将我的脚本从 python 移植到 perl 来解决这个问题，所以没有任何迫切需要解决方案。但我真的很想更好地理解为什么这场完美的环境风暴会失败。

score 4 · Accepted Answer

zgrep 只是一个 shell 脚本，它大致相当于gunzip test.txt.gz | grep -m1 'a'. gunzip 只是提取块并将它们传递给 grep。然后，当 grep 找到该模式时，它会退出。

如果到那时gunzip 还没有完成文件的解压缩，将来对gunzip 的stdout（连接到grep 的stdin）的写入将失败。这正是您的情况所发生的事情：

gzip: stdout: Broken pipe

score 2 · Accepted Answer

感谢 MilesF，这篇文章完美地解释了它： https ://blog.nelhage.com/2010/02/a-very-subtle-bug/

python代码应该改成这样：

import subprocess
import signal

print subprocess.check_output("zgrep -m1 'a' test.txt.gz", shell=True, , preexec_fn=lambda:signal.signal(signal.SIGPIPE, signal.SIG_DFL))

python - 仅针对大文件从 python 调用 zgrep 系统后出错

脚本：

错误：

看似无关的更改将防止错误：

2 回答 2

Related

Reference