python - 有没有办法从 python 文件中读取 10000 行？

Question

我在 python 方面相对较新，在 C 上工作了很多。由于我在 python 中看到了很多我不知道的新函数，我想知道是否有一个函数可以从 python 中的文件中请求 10000 行。

如果存在这种功能，我期望这样的事情：

lines = get_10000_lines(file_pointer)

python是否有内置函数，或者我可以为此下载任何模块？如果没有，我该如何做到这一点是最简单的方法。我需要分析一个巨大的文件，所以我想读取 10000 行并每次分析以节省内存。

感谢您的帮助！

score 24 · Accepted Answer

f.readlines() 返回一个包含文件中所有数据行的列表。如果给定一个可选参数 sizehint，它会从文件中读取那么多字节以及足够多的字节来完成一行，并从中返回这些行。这通常用于允许有效地逐行读取大文件，但不必将整个文件加载到内存中。只会返回完整的行。

从文档。

这不完全是您所要求的，因为这限制了读取的字节而不是读取的行，但我认为这是您想要做的。

score 21 · Accepted Answer

from itertools import islice

with open(filename) as f:
    first10000 = islice(f, 10000)

这设置first10000为一个可迭代的对象，即您可以使用

for x in first10000:
    do_something_with(x)

如果您需要列表，请list(islice(f, 10000))改为这样做。

当文件包含少于 10k 行时，这将只返回文件中的所有行，没有填充（与range基于 - 的解决方案不同）。以块的形式读取文件时，EOF 会通过结果中的 <10000 行发出信号：

with open(filename) as f:
    while True:
        next10k = list(islice(f, 10000))  # need list to do len, 3 lines down
        for ln in next10k:
            process(ln)
        if len(next10k) < 10000:
            break

score 4 · Accepted Answer

你真的在乎你一次有多少行吗？逐行迭代文件对象通常是最有意义的：

f = open('myfile.txt', 'r')
for line in f:
    print line

python 文档表明这是处理文件的首选方法：

读取行的另一种方法是遍历文件对象。这是内存效率高，速度快，并导致更简单的代码。

有关示例，请参阅python 文档。

score 3 · Accepted Answer

只需打开文件并告诉 Python 读取一行 10,000 次。

lines = None
with open('<filename>') as file:
    lines = (file.readline() for i in range(10000))

score 3 · Accepted Answer

您确定文件太大而无法存储吗？

由于函数调用有开销（即调用同一个函数 10000 次很慢）并且内存很便宜，我建议一次读取所有行，然后切片到结果列表中。如果您想稍后处理下一个 10000，这肯定是最快的方法——它们会立即为您准备好。

with open("filename") as f:
    lines = f.readlines()

indices = range(0, len(lines), 10000) + [len(lines)]
for start, stop in zip(indices, indices[1:]):
    do_stuff_with(lines[start:stop])

当然，如果文件不适合空闲内存，那么这将不起作用。如果是这样，我会选择ChipJust 的回答。你甚至可以使用readlinessizehint创建一个寻找目标的函数，如果这很重要的话，它会在恰好 10000 行上“归位” tell。seek

score 3 · Accepted Answer

f = open('myfile.txt', 'r')
while True:
    bytes_lines = f.readlines(10000) # read no more than 10000 bytes
    if not bytes_lines: break # stop looping if no lines read
    for line in bytes_lines:
        text = line.decode("knownencoding") # text will be a unicode object

一次阅读大量文本然后处理它会更快。这会读取大量文本，然后为您将其分成几行。这节省了读取。它也只会给你完整的线条，所以你不需要处理加入线条的存根。

请对此进行测试以确保从文件末尾读取文件不会引发异常。

score 3 · Accepted Answer

没有任何功能可以随心所欲地工作。你可以很容易地写一个，但你可能不会更好。例如，如果您获得此处显示的许多解决方案的行列表，那么您必须单独分析每一行：

def get_10000_lines(f):
    while True:
        chunk = list(itertools.islice(f, 10000))
        if not chunk:
            break
        yield chunk

如果这样做，您不妨一次读取一行文件，然后分析每个字符串。文件 I/O 无论如何都会被缓冲：

for line in f:
    analyze_the_line(line)

如果您想要一个包含 10,000 行的字符串，那么您将单独阅读每一行并将它们连接在一起：

for chunk in get_10000_lines(f):
    str_10k = "".join(chunk)
    analyze_a_bunch(str_10k)

现在您正在做很多工作来分配和连接字符串，这可能不值得。

最好是如果您可以对部分行进行分析，那么您可以只读取 1Mb 块中的文件：

while True:
    chunk = f.read(1000000)
    if not chunk:
        break
    analyze_a_bunch(chunk)

score 3 · Accepted Answer

从其他几个解决方案中汲取灵感，但增加了一个转折......

>>> with open('lines.txt', 'r') as lines:
...     chunks = iter(lambda: list(itertools.islice(lines, 7)), [])
...     for chunk in chunks:
...         print chunk
... 
['0\n', '1\n', '2\n', '3\n', '4\n', '5\n', '6\n']
['7\n', '8\n', '9\n', '10\n', '11\n', '12\n', '13\n']
['14\n', '15\n', '16\n', '17\n', '18\n', '19\n', '20\n']
['21\n', '22\n', '23\n', '24\n', '25\n', '26\n', '27\n']
['28\n', '29\n', '30\n', '31\n', '32\n', '33\n', '34\n']
['35\n', '36\n', '37\n', '38\n', '39\n', '40\n', '41\n']
['42\n', '43\n', '44\n', '45\n', '46\n', '47\n', '48\n']
['49\n', '50\n', '51\n', '52\n', '53\n', '54\n', '55\n']
['56\n', '57\n', '58\n', '59\n', '60\n', '61\n', '62\n']
['63\n', '64\n', '65\n', '66\n', '67\n', '68\n', '69\n']
['70\n', '71\n', '72\n', '73\n', '74\n', '75\n', '76\n']
['77\n', '78\n', '79\n', '80\n', '81\n', '82\n', '83\n']
['84\n', '85\n', '86\n', '87\n', '88\n', '89\n', '90\n']
['91\n', '92\n', '93\n', '94\n', '95\n', '96\n', '97\n']
['98\n', '99\n']

但在这里我必须承认，正如其他人所说，使用readlines字节提示会快一点，只要你不需要正好10000 行（或每次 10000 行）。但是，我不相信这是因为它的读取次数较少。文档字符串说“readlines反复调用 readline() 并返回如此读取的行列表”。所以我认为速度增益来自于减少少量的迭代器开销。定义（使用 Marcin 的代码）：

def do_nothing_islice(filename, nlines):
    with open(filename, 'r') as lines:
        chunks = iter(lambda: list(itertools.islice(lines, nlines)), [])
        for chunk in chunks:
            chunk

def do_nothing_readlines(filename, nbytes):
    with open(filename, 'r') as lines:
        while True:
            bytes_lines = lines.readlines(nbytes)
            if not bytes_lines:
                break
            bytes_lines

测试：

>>> %timeit do_nothing_islice('lines.txt', 1000)
10 loops, best of 3: 63.6 ms per loop
>>> %timeit do_nothing_readlines('lines.txt', 7000) # 7-byte lines, ish
10 loops, best of 3: 56.8 ms per loop
>>> %timeit do_nothing_islice('lines.txt', 10000)
10 loops, best of 3: 58.4 ms per loop
>>> %timeit do_nothing_readlines('lines.txt', 70000) # 7-byte lines, ish
10 loops, best of 3: 50.7 ms per loop
>>> %timeit do_nothing_islice('lines.txt', 100000)
10 loops, best of 3: 76.1 ms per loop
>>> %timeit do_nothing_readlines('lines.txt', 700000) # 7-byte lines, ish
10 loops, best of 3: 70.1 ms per loop

在平均行长为 7（0 -> 1000000 逐行打印）的文件上，使用readlines大小提示会快一点。但只有一点。还要注意奇怪的缩放——我不明白那里发生了什么。

python - 有没有办法从 python 文件中读取 10000 行？

8 回答 8

Related

Reference