18

我编写了以下脚本以将目录中的所有文件连接到一个文件中。

可以根据以下方面进行优化吗

  1. 惯用的蟒蛇

  2. 时间

这是片段:

import time, glob

outfilename = 'all_' + str((int(time.time()))) + ".txt"

filenames = glob.glob('*.txt')

with open(outfilename, 'wb') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            infile = readfile.read()
            for line in infile:
                outfile.write(line)
            outfile.write("\n\n")
4

6 回答 6

40

用于shutil.copyfileobj复制数据:

import shutil

with open(outfilename, 'wb') as outfile:
    for filename in glob.glob('*.txt'):
        if filename == outfilename:
            # don't want to copy the output into the output
            continue
        with open(filename, 'rb') as readfile:
            shutil.copyfileobj(readfile, outfile)

shutil从块中读取readfile对象,直接将它们写入文件对象outfile。不要使用readline()或迭代缓冲区,因为您不需要查找行尾的开销。

读写使用相同的模式;这在使用 Python 3 时尤其重要;我在这里都使用了二进制模式。

于 2013-07-19T15:11:50.770 回答
2

您可以直接遍历文件对象的行,而无需将整个内容读入内存:

with open(fname, 'r') as readfile:
    for line in readfile:
        outfile.write(line)
于 2013-07-19T15:11:24.067 回答
2

不需要使用那么多变量。

with open(outfilename, 'w') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            outfile.write(readfile.read() + "\n\n")
于 2013-07-19T15:15:03.267 回答
2

使用 Python 2.7,我做了一些“基准”测试

outfile.write(infile.read())

对比

shutil.copyfileobj(readfile, outfile)

我迭代了 20 多个 .txt 文件,大小从 63 MB 到 313 MB 不等,联合文件大小约为 2.6 GB。在这两种方法中,普通读取模式的性能都优于二进制读取模式,而shutil.copyfileobj 通常比outfile.write 快。

将最差组合(outfile.write,二进制模式)与最佳组合(shutil.copyfileobj,正常读取模式)进行比较时,差异非常显着:

outfile.write, binary mode: 43 seconds, on average.

shutil.copyfileobj, normal mode: 27 seconds, on average.

outfile 在正常读取模式下的最终大小为 2620 MB,而在二进制读取模式下为 2578 MB。

于 2015-10-27T10:57:21.420 回答
1

fileinput模块提供了一种自然的方式来迭代多个文件

for line in fileinput.input(glob.glob("*.txt")):
    outfile.write(line)
于 2013-07-19T15:15:17.613 回答
1

我很想检查更多关于性能的信息,我使用了 Martijn Pieters 和 Stephen Miller 的答案。

我尝试了带shutil和不带的二进制和文本模式shutil。我试图合并 270 个文件。

文本模式 -

def using_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                outfile.write(readfile.read())

二进制模式 -

def using_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                outfile.write(readfile.read())

二进制模式的运行时间 -

Shutil - 20.161773920059204
Normal - 17.327500820159912

文本模式的运行时间 -

Shutil - 20.47757601737976
Normal - 13.718038082122803

看起来在两种模式下,shutil 执行相同,而文本模式比二进制更快。

操作系统:Mac OS 10.14 Mojave。MacBook Air 2017。

于 2019-03-22T08:37:13.610 回答