python - 寻求惰性化 Python 数据的指南

Question

注意：我根据评论和答案编辑了原始问题。

我的问题是，如果将大量 Python 数据输入到程序中，如何使这些数据变得惰性，从而使内存不溢出？

例如，如果一个列表是通过读入一个文件并将每一行或每一行的一部分附加到一个列表中来构建的，那么这个列表是惰性的吗？换句话说，可以附加一个列表并且该列表是惰性的吗？是否附加到将整个文件读入内存的列表？

我知道如果我想浏览该列表，我会编写一个生成器函数来保持访问惰性。

触发这个问题的是最近的 SO 帖子

如果这些数据来自具有 10M 行的数据库表，例如我们的 MySQL 每日水表读取表之一，我不会在不知道如何使数据变得惰性的情况下使用 mysqldb fetchall() 命令。相反，我会一次读一行。

但是如果我确实希望内存中的数据内容作为惰性序列呢？我将如何在 Python 中做到这一点？

鉴于我没有提供具有特定问题的源代码，我正在寻找的答案是指向 Python 文档中某个位置或解决此问题的其他位置的一个或多个指针。

谢谢。

score 2 · Accepted Answer

Python 中用于延迟呈现序列的机制是生成器。

生成器 [原文如此] 函数允许您声明一个行为类似于迭代器的函数，即它可以在 for 循环中使用。

score 1 · Accepted Answer

A list is almost the opposite of lazy. The best example would be the difference between range and xrange; range creates a list, while xrange lazily gives you each number as you need it, using a generator.

>>> total = 0
>>> for i in range(2**30):
    total += i

Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    for i in range(2**30):
MemoryError
>>> print total
0
>>> for i in xrange(2**30):
    total += i
>>> print total
576460751766552576

Many of the places that will take a list will also take a generator in its place. This is so true that Python 3 does away with xrange entirely, and uses it to replace the normal range.

>>> total2 = sum(xrange(2**30))
>>> print total2
576460751766552576

It's easy to make your own generator:

>>> def myrange(n):
        i = 0
        while i < n:
            yield i
            i += 1
>>> sum(xrange(10))
45
>>> sum(myrange(10))
45
>>> myrange(10)
<generator object myrange at 0x02A2DDA0>

And if you do really need a list, that's easy too. But then of course it's no longer lazy.

>>> list(myrange(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

score 1 · Accepted Answer

“惰性”代码的基本思想是代码在需要数据之前不会获取数据。

例如，假设我正在编写一个函数来复制一个文本文件。将整个文件读入内存，然后写入整个文件，这不会是偷懒。.readlines()使用该方法从所有输入行中构建一个列表也不会是懒惰的。但是一次读一行然后读完后写每一行会很懒惰。

# non-lazy
with open(input_fname) as in_f, open(output_fname, "w") as out_f:
    bytes = in_f.read()
    out_f.write(bytes)

# also non-lazy
with open(input_fname) as in_f, open(output_fname, "w") as out_f:
    lines = in_f.readlines()
    for line in lines:
        out_f.write(line)

# lazy
with open(input_fname) as in_f, open(output_fname, "w") as out_f:
    for line in in_f:  # only gets one line at a time
        out_f.write(line) # write each line as we get it

为了帮助您的代码变得懒惰，Python 允许您使用“生成器”。yield使用该语句编写的函数是生成器。对于您的数据库示例，您可以编写一个生成器，该生成器一次从数据库中生成一行，然后您可以编写如下代码：

def db_rows(database_name):
    # code to open the database goes here
    # write a loop that reads rows
        # inside the loop, use yield on each row
        yield row
    # code to close the database goes here

for row in db_rows(database_name):
    # do something with the row

score 0 · Accepted Answer

但是如果我确实希望内存中的数据内容作为惰性序列呢？

以下是创建惰性序列的方法：不是存储项目，而是根据请求动态生成它们，但将其隐藏在[]语法后面。我一直忘记SQL API是如何工作的，所以下面应该理解为伪代码。

class Table(object):
    def __init__(self, db_cursor):
        self._cursor = db_cursor

    def __getitem__(self, i):
        return self._cursor.fetch_row(i)

    def __iter__(self):
        for i in xrange(len(self)):
            yield self[i]

    def __len__(self):
        return self._cursor.number_of_rows()

这可以在许多可以list使用 a 但实际上不存储任何内容的情况下使用。根据需要添加缓存（取决于访问模式）。

score 0 · Accepted Answer

如果您只想要可以迭代的东西，我会研究生成器：

PEP 255包含很多相关信息。

取决于文件类型的另一个选项是linecache模块。

python - 寻求惰性化 Python 数据的指南

5 回答 5

Related

Reference