python - 文本处理 - Python 与 Perl 的性能对比

Question

这是我的 Perl 和 Python 脚本，用于对大约 21 个日志文件进行一些简单的文本处理，每个大约 300 KB 到 1 MB（最大）x 重复 5 次（总共 125 个文件，由于日志重复了 5 次）。

Python 代码（修改为使用已编译re和使用的代码re.I）

#!/usr/bin/python

import re
import fileinput

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for line in fileinput.input():
    fn = fileinput.filename()
    currline = line.rstrip()

    mprev = exists_re.search(currline)

    if(mprev):
        xlogtime = mprev.group(1)

    mcurr = location_re.search(currline)

    if(mcurr):
        print fn, xlogtime, mcurr.group(1)

Perl 代码

#!/usr/bin/perl

while (<>) {
    chomp;

    if (m/^(.*?) INFO.*Such a record already exists/i) {
        $xlogtime = $1;
    }

    if (m/^AwbLocation (.*?) insert into/i) {
        print "$ARGV $xlogtime $1\n";
    }
}

而且，在我的 PC 上，这两个代码都生成了完全相同的 10,790 行结果文件。而且，这是 Cygwin 的 Perl 和 Python 实现的时间安排。

User@UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.py *log* *log* *log* *log* *log* >
summarypy.log

real    0m8.185s
user    0m8.018s
sys     0m0.092s

User@UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.pl *log* *log* *log* *log* *log* >
summarypl.log

real    0m1.481s
user    0m1.294s
sys     0m0.124s

最初，使用 Python 需要 10.2 秒，而使用 Perl 只需 1.9 秒即可完成这个简单的文本处理。

（更新）但是，在 Python 的编译re版本之后，现在在 Python 中需要 8.2 秒，在 Perl 中需要 1.5 秒。Perl 仍然快得多。

有没有办法完全提高 Python 的速度，或者很明显 Perl 将成为简单文本处理的快速方法。

顺便说一句，这不是我为简单文本处理所做的唯一测试……而且，我制作源代码的每一种不同方式，总是 Perl 以很大的优势获胜。m/regex/而且，在简单的匹配和打印方面，Python 没有一次表现得更好。

请不要建议使用 C、C++、Assembly、其他风格的 Python 等。

与标准 Perl 相比，我正在寻找使用标准 Python 及其内置模块的解决方案（甚至不使用模块）。男孩，由于它的可读性，我希望将 Python 用于我的所有任务，但为了放弃速度，我不这么认为。

所以，请建议如何改进代码以与 Perl 有可比的结果。

更新：2012-10-18

正如其他用户所建议的，Perl 有它的位置，Python 有它的位置。

因此，对于这个问题，可以安全地得出结论，对于成百上千个文本文件的每一行的简单正则表达式匹配并将结果写入文件（或打印到屏幕），Perl 将永远、永远在这项工作的性能上获胜. 就这么简单。

请注意，当我说 Perl 在性能上胜出时……只比较标准 Perl 和 Python……不诉诸一些晦涩难懂的模块（对于像我这样的普通用户来说是晦涩难懂的），也不要从 Python 调用 C、C++、汇编库或 Perl。我们没有时间为简单的文本匹配工作学习所有这些额外的步骤和安装。

因此，Perl 非常适合文本处理和正则表达式。

Python 在其他地方也有它的优势。

2013 年 5月 29 日更新：这里有一篇进行类似比较的优秀文章。Perl 再次赢得了简单文本匹配的胜利……有关更多详细信息，请阅读文章。

score 18 · Accepted Answer

这正是 Perl 旨在做的事情，所以它更快并不让我感到惊讶。

Python 代码中的一个简单优化是预编译这些正则表达式，这样它们就不会每次都重新编译。

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists')
location_re = re.compile(r'^AwbLocation (.*?) insert into')

然后在你的循环中：

mprev = exists_re.search(currline)

和

mcurr = location_re.search(currline)

这本身不会神奇地使您的 Python 脚本与您的 Perl 脚本保持一致，但是在没有先编译的情况下在循环中重复调用 re 在 Python 中是不好的做法。

score 14 · Accepted Answer

假设：Perl 在不匹配的行中花费更少的时间回溯，因为它具有 Python 没有的优化。

你通过替换得到什么

^(.*?) INFO.*Such a record already exists

和

^((?:(?! INFO).)*?) INFO.*Such a record already

或者

^(?>(.*?) INFO).*Such a record already exists

score 4 · Accepted Answer

在 Python 中，函数调用在时间方面有点昂贵。然而你有一个循环不变的函数调用来获取循环内的文件名：

fn = fileinput.filename()

将此行for移到循环上方，您应该会看到 Python 时序有所改进。不过可能还不足以击败 Perl。

score 1 · Accepted Answer

一般来说，所有的人为基准都是邪恶的。但是，在其他一切都相同的情况下（算法方法），您可以在相对基础上进行改进。但是，应该注意的是我不使用 Perl，所以我不能支持它。话虽如此，对于 Python，您可以尝试使用Pyrex或Cython来提高性能。或者，如果您喜欢冒险，您可以尝试通过ShedSkin将 Python 代码转换为 C++ （适用于大多数核心语言，以及一些 - 但不是全部核心模块）。

不过，您可以遵循此处发布的一些提示：

http://wiki.python.org/moin/PythonSpeed/PerformanceTips

score 1 · Accepted Answer

我希望 Perl 更快。只是好奇，您可以尝试以下方法吗？

#!/usr/bin/python

import re
import glob
import sys
import os

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for mask in sys.argv[1:]:
    for fname in glob.glob(mask):
        if os.path.isfile(fname):
            f = open(fname)
            for line in f:
                mex = exists_re.search(line)
                if mex:
                    xlogtime = mex.group(1)

                mloc = location_re.search(line)
                if mloc:
                    print fname, xlogtime, mloc.group(1)
            f.close()

更新作为对“它太复杂”的反应。

当然，它看起来比 Perl 版本更复杂。Perl 是围绕正则表达式构建的。这样，您几乎找不到在正则表达式中速度更快的解释语言。Perl 语法...

while (<>) {
    ...
}

...还隐藏了许多必须以更通用的语言完成的事情。另一方面，如果将不可读的部分移出，则很容易使 Python 代码更具可读性：

#!/usr/bin/python

import re
import glob
import sys
import os

def input_files():
    '''The generator loops through the files defined by masks from cmd.'''
    for mask in sys.argv[1:]:
        for fname in glob.glob(mask):
            if os.path.isfile(fname):
                yield fname


exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for fname in input_files():
    with open(fname) as f:        # Now the f.close() is done automatically
        for line in f:
            mex = exists_re.search(line)
            if mex:
                xlogtime = mex.group(1)

            mloc = location_re.search(line)
            if mloc:
                print fname, xlogtime, mloc.group(1)

这里def input_files()可以放置在其他地方（比如在另一个模块中），或者可以重复使用。while (<>) {...}即使在语法上不一样，也可以轻松地模仿 Perl ：

#!/usr/bin/python

import re
import glob
import sys
import os

def input_lines():
    '''The generator loops through the lines of the files defined by masks from cmd.'''
    for mask in sys.argv[1:]:
        for fname in glob.glob(mask):
            if os.path.isfile(fname):
                with open(fname) as f: # now the f.close() is done automatically
                    for line in f:
                        yield fname, line

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for fname, line in input_lines():
    mex = exists_re.search(line)
    if mex:
        xlogtime = mex.group(1)

    mloc = location_re.search(line)
    if mloc:
        print fname, xlogtime, mloc.group(1)

那么最后一个for可能看起来和 Perl 的一样简单（原则上）while (<>) {...}。这种可读性增强在 Perl 中更加困难。

无论如何，它不会使 Python 程序更快。Perl 将在这里再次更快。Perl是一个文件/文本处理器。但是——在我看来——Python 对于更通用的目的来说是一种更好的编程语言。

python - 文本处理 - Python 与 Perl 的性能对比

5 回答 5

Related

Reference