python - 读取 6.9GB 文件会导致分段错误

Question

我正在尝试在 Linux 上打开最新的日语 Wikipedia 数据库以在 Python 3.3.1 中阅读，但Segmentation fault (core dumped)这个短程序出现错误：

with open("jawiki-latest-pages-articles.xml") as f:
    text = f.read()

文件本身很大：

-rw-r--r-- 1 fredrick users 7368183805 May 17 20:19 jawiki-latest-pages-articles.xml

因此，我可以存储多长时间的字符串似乎是有上限的。解决这种情况的最佳方法是什么？

我的最终目标是计算文件中最常见的字符，有点像 Jack Halpern 的“报纸上最常用的汉字”的现代版本。:)

score 11 · Accepted Answer

不要一次阅读整篇文章。即使你的 Python 发行版被编译为 64 位程序（在 32 位程序中分配超过 4 GB 的虚拟内存根本不可能），即使你有足够的 RAM 来存储它，它仍然很糟糕一次将所有内容读入内存的想法。

一个简单的选择是一次读取一行并处理每一行：

with open("jawiki-latest-pages-articles.xml") as f:
    for line in f:
        # Process one line

或者，您可以以固定大小的块处理它：

while True:
    data = f.read(65536)  # Or any other reasonable-sized chunk
    if not data:
        break
    # Process one chunk of data.  Make sure to handle data which overlaps
    # between chunks properly, and make sure to handle EOF properly

score 0 · Accepted Answer

这是我最终使用的程序，如果有人好奇的话。

from collections import Counter

counter = Counter()

progress = 0
with open("jawiki-latest-pages-articles.xml") as f:
    for line in f:
        progress += 1
        counter.update(line)
        if not progress%10000: print("Processing line {0}..., number {1}".format(line[:10], progress))

output = open("output.txt", "w+")

for k, v in counter.items():
    print("{0}\t{1}".format(k, v), file=output)

output.close()

python - 读取 6.9GB 文件会导致分段错误

2 回答 2

Related

Reference