python - 大数据结构（列表、字典）中的 Python 内存泄漏——可能是什么原因？

Question

代码非常简单。它不应该有任何泄漏，因为一切都在函数内部完成。并且没有返回任何内容。我有一个函数可以遍历文件中的所有行（~20 MiB）并将它们全部放入一个列表中。
提到的功能：

def read_art_file(filename, path_to_dir):
    import codecs
    corpus = []
    corpus_file = codecs.open(path_to_dir + filename, 'r', 'iso-8859-15')
    newline = corpus_file.readline().strip()
    while newline != '':
        # we put into @article a @newline of file and some other info
        # (i left those lists blank for readability)
        article = [newline, [], [], [], [], [], [], [], [], [], [], [], []]
        corpus.append(article)
        del newline
        del article
        newline = corpus_file.readline().strip()
    memory_usage('inside function')
    for article in corpus:
        for word in article:
            del word
        del article
    del corpus
    corpus_file.close()
    memory_usage('inside: after corp deleted')
    return

这是主要代码：

memory_usage('START')
path_to_dir = '/home/soshial/internship/training_data/parser_output/'
read_art_file('accounting.n.txt.wpr.art', path_to_dir)
memory_usage('outside func')
time.sleep(5)
memory_usage('END')

全部memory_usage只打印脚本分配的 KiB 数量。

执行脚本

如果我运行脚本，它会给我：

开始内存：6088 KiB
内部内存：393752 KiB（20 MiB 文件 + 列表占用 400 MiB）
内部：corp 删除后内存：43360 KiB
外部 func 内存：34300 KiB（34300-6088 = 28 MiB 泄漏）
完成内存：34300 KiB

不带列表执行

如果我做同样article的事情，但附加corpus注释掉：

article = [newline, [], [], [], [], [], ...]  # we still assign data to `article`
# corpus.append(article)  # we don't have this string during second execution

这种方式输出给了我：

开始内存：6076 KiB
内部内存：6076 KiB
内部：在 corp 删除内存后：6076 KiB
外部 func 内存：6076 KiB
完成内存：6076 KiB

问题：

因此，这样所有内存都被释放了。我需要释放所有内存，因为我要处理数百个这样的文件。
是我做错了什么还是CPython解释器错误？

升级版。这就是我检查内存消耗的方式（取自其他一些stackoverflow问题）：

def memory_usage(text = ''):
    """Memory usage of the current process in kilobytes."""
    status = None
    result = {'peak': 0, 'rss': 0}
    try:
        # This will only work on systems with a /proc file system
        # (like Linux).
        status = open('/proc/self/status')
        for line in status:
            parts = line.split()
            key = parts[0][2:-1].lower()
            if key in result:
                result[key] = int(parts[1])
    finally:
        if status is not None:
            status.close()
    print('>', text, 'memory:', result['rss'], 'KiB  ')
    return

score 8 · Accepted Answer

请注意，python从不保证您的代码使用的任何内存实际上都会返回给操作系统。垃圾收集的所有保证是，已收集的对象使用的内存在未来某个时间可以被另一个对象免费使用。

从我读到的关于内存分配器的 Cpython 实现的^{1中，内存在“池”中分配以提高效率。}当一个池已满时，python 将分配一个新池。如果一个池只包含死对象，Cpython 实际上释放与该池相关的内存，否则它不会。这可能会导致在某个函数或其他内容之后出现多个部分满的池。但是，这并不意味着它是“内存泄漏”。（Cpython 仍然知道内存，并且可能会在以后释放它）。

^{¹我不是 python 开发者，所以这些细节可能不正确或至少不完整}

score 1 · Accepted Answer

这个循环

for article in corpus:
    for word in article:
        del word
    del article

不释放内存。del word只是减少 name 引用的对象的引用计数word。但是，当设置循环变量时，您的循环会将每个对象的引用计数加一。换句话说，由于这个循环，任何对象的引用计数都没有净变化。

当您注释掉对的调用时corpus.append，您不会保留对从一次迭代到下一次从文件中读取的对象的任何引用，因此解释器可以更早地释放内存，这导致您观察到的内存减少。

python - 大数据结构（列表、字典）中的 Python 内存泄漏——可能是什么原因？

执行脚本

不带列表执行

问题：

2 回答 2

Related

Reference