python - 什么是优化列表对象的好方法，而对象比python中的数据本身占用更多的内存空间

Question

什么是优化列表对象的好方法，而对象比python中的数据本身占用更多的内存空间？

假设我们有 100M 个字符串列表对象（可能来自 long_string.split('\t')），每个字符串对象只保存几个字节的字符串数据，但它自己的对象占用了几十个内存。python中有什么好的替代解决方案？

score 3 · Accepted Answer

一个好的方法可能是不要一次将它们存储在内存中，例如通过使用生成器来按需生成对象。

score 0 · Accepted Answer

一次处理多行大文件中的一行：

def main():
    with open('input_file.txt') as file:
        for line in file:
            process_line(line)

如果文件包含数字（例如，每行一个短整数）并且您需要所有数字，那么您可以使用numpy 数组：

from functools import partial
from itertools import imap
import numpy as np

def count_lines(file):
    """Return number of lines in the file."""
    return sum(chunk.count('\n') for chunk in iter(partial(file.read, 1<<15),''))

with open('input_file.txt', 'rb') as file:
    nlines = count_lines(file) # count lines to avoid overallocation in fromiter
    file.seek(0) # rewind
    a = np.fromiter(imap(int, file), dtype=np.int16, count=nlines)

score 0 · Accepted Answer

我认为主要问题是您读取内存中的整个文件，如果可能的话，您应该分块读取文件并处理它们

file_object = open('filename', 'r')

while True:
    line = file_object.readline()
    if not line: break
    process_line(line)

score 0 · Accepted Answer

def split_lines(text):
    temp = ''
    for char in text:
        if (char != '\n'):
            temp += char
        else:
            yield temp
            temp = ''
for each in split_lines(text):
     #process each line

我检查了一下，这行得通，错误似乎比仅使用要花费更长的时间

for each in text.split('\n'):
    #process each line

但它节省了大量的内存空间，因为文本数据中有数十亿行！

python - 什么是优化列表对象的好方法，而对象比python中的数据本身占用更多的内存空间

4 回答 4

Related

Reference