python - Python 将 2GB 的文本文件加载到内存中

Question

在 Python 2.7 中，当我将 2.5GB 文本文件中的所有数据加载到内存中以进行更快的处理时，如下所示：

>>> f = open('dump.xml','r')
>>> dump = f.read()

我收到以下错误：

Python(62813) malloc: *** mmap(size=140521659486208) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError

为什么 Python 尝试140521659486208为字节数据分配字节内存2563749237？如何修复代码以使其加载所有字节？

我有大约 3GB 的可用 RAM。该文件是一个维基词典的 xml 转储。

score 13 · Accepted Answer

如果您使用mmap，您将能够立即将整个文件加载到内存中。

import mmap

with open('dump.xml', 'rb') as f:
  # Size 0 will read the ENTIRE file into memory!
  m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) #File is open read-only

  # Proceed with your code here -- note the file is already in memory
  # so "readine" here will be as fast as could be
  data = m.readline()
  while data:
    # Do stuff
    data = m.readline()

score -1 · Accepted Answer

基于一些快速的谷歌搜索，我发现这个论坛帖子似乎解决了您似乎遇到的问题。假设您基于错误代码运行 Mac 或 Linux，您可以尝试使用gc.enable()或gc.collect()按照论坛帖子中的建议实施垃圾收集。

python - Python 将 2GB 的文本文件加载到内存中

2 回答 2

Related

Reference