python - 将文件顺序附加到另一个文件时如何克服内存问题

Question

我正在运行以下脚本，以便通过循环数月和数年（如果文件存在）将文件彼此附加，我刚刚使用更大的数据集对其进行了测试，我预计输出文件的大小约为 600mb。但是我遇到了内存问题。首先，遇到内存问题是否正常（我的电脑有 8 GB 内存）我不确定我是如何吃掉所有这些内存空间的？

我正在运行的代码

import datetime,  os
import StringIO

stored_data = StringIO.StringIO()

start_year = "2011"
start_month = "November"
first_run = False

current_month = datetime.date.today().replace(day=1)
possible_month = datetime.datetime.strptime('%s %s' % (start_month, start_year), '%B %Y').date()
while possible_month <= current_month:
    csv_filename = possible_month.strftime('%B %Y') + ' MRG.csv'
    if os.path.exists(csv_filename):
        with open(csv_filename, 'rb') as current_csv:
            if first_run != False:
                next(current_csv)
            else:
                first_run = True
            stored_data.writelines(current_csv)
    possible_month = (possible_month + datetime.timedelta(days=31)).replace(day=1)
if stored_data:
    contents = stored_data.getvalue()
    with open('FullMergedData.csv', 'wb') as output_csv:
        output_csv.write(contents)

我收到的引用：

Traceback (most recent call last):
  File "C:\code snippets\FullMerger.py", line 23, in <module>
    contents = stored_output.getvalue()
  File "C:\Python27\lib\StringIO.py", line 271, in getvalue
    self.buf += ''.join(self.buflist)
MemoryError

如何解决此问题或使此代码更有效地克服此问题的任何想法。非常感谢
AEA

编辑1

运行 alKid 提供的代码后，我收到了以下回溯。

Traceback (most recent call last):
  File "C:\FullMerger.py", line 22, in <module>
    output_csv.writeline(line)
AttributeError: 'file' object has no attribute 'writeline'

我通过将其更改为修复了上述问题，writelines但是我仍然收到以下跟踪。

Traceback (most recent call last):
  File "C:\FullMerger.py", line 19, in <module>
    next(current_csv)
StopIteration

score 5 · Accepted Answer

在stored_data中，您正在尝试存储整个文件，并且由于它太大，您会收到所显示的错误。

一种解决方案是每行写入文件。它的内存效率要高得多，因为您只在缓冲区中存储一行数据，而不是整个 600 MB。

简而言之，结构可以是这样的：

with open('FullMergedData.csv', 'a') as output_csv: #this will append  
# the result to the file.
    with open(csv_filename, 'rb') as current_csv:
        for line in current_csv:   #loop through the lines
            if first_run != False:
                next(current_csv)
                first_run = True #After the first line,
                #you should immidiately change first_run to true.
            output_csv.writelines(line)  #write it per line

应该解决你的问题。希望这可以帮助！

score 3 · Accepted Answer

您的内存错误是因为您在写入之前将所有数据存储在缓冲区中。考虑使用类似copyfileobj直接从一个打开的文件对象复制到另一个的东西，这一次只会缓冲少量数据。您也可以逐行执行，这将产生大致相同的效果。

更新

使用copyfileobj应该比逐行写入文件要快得多。这是一个如何使用的示例copyfileobj。此代码打开两个文件，如果skip_first_line为 True，则跳过输入文件的第一行，然后将该文件的其余部分复制到输出文件。

skip_first_line = True

with open('FullMergedData.csv', 'a') as output_csv:
    with open(csv_filename, 'rb') as current_csv:
        if skip_first_line:
            current_csv.readline()
        shutil.copyfileobj(current_csv, output_csv)

请注意，如果您正在使用copyfileobj，您将希望使用current_csv.readline()而不是next(current_csv). 这是因为对文件对象的迭代会缓冲文件的一部分，这通常非常有用，但在这种情况下您不希望这样做。更多关于这里。

python - 将文件顺序附加到另一个文件时如何克服内存问题

我正在运行的代码

我收到的引用：

编辑1

2 回答 2

Related

Reference