python - 只读取大文本文件的结尾

Question

可能重复：
用python获取文件的最后n行，类似于tail
用python逆序读取文件

我有一个大小约为 15GB 的文件，它是一个日志文件，我应该分析它的输出。我已经对一个类似但非常小的文件进行了基本解析，只有几行日志记录。解析字符串不是问题。问题是巨大的文件及其包含的冗余数据量。

基本上我正在尝试制作一个我可以说的python脚本；例如，给我文件的最后 5000 行。这又是基本的处理论点和所有这些，没什么特别的，我可以做到。

但是我如何定义或告诉文件阅读器只读取我从文件末尾指定的行数？我试图跳过文件开头的huuuuuuge行，因为我对这些不感兴趣，老实说，从 txt 文件中读取大约 15GB 的行需要太长时间。有没有办法出错..从文件末尾开始读取？这甚至有意义吗？

这一切都归结为读取 15GB 文件的问题，逐行读取时间太长。所以我想在一开始就跳过已经冗余的数据（至少对我来说是冗余的），只从我想读取的文件末尾读取行数。

明显的答案是手动将 N 行从文件复制到另一个文件，但是有没有办法半自动地做到这一点，只是用 python 从文件末尾读取 N 行？

score 21 · Accepted Answer

将其移植到 Unix：

import os
os.popen('tail -n 1000 filepath').read()

如果您需要能够访问 stderr（和其他一些功能），请使用 subprocess.Popen 而不是 os.popen

score 13 · Accepted Answer

您需要寻找到文件的末尾，然后从末尾读取一些块，计算行数，直到找到足够的换行符来读取您的n行。

基本上，您正在重新实现一种简单的尾部形式。

这是一些经过轻微测试的代码，可以做到这一点：

import os, errno

def lastlines(hugefile, n, bsize=2048):
    # get newlines type, open in universal mode to find it
    with open(hugefile, 'rU') as hfile:
        if not hfile.readline():
            return  # empty, no point
        sep = hfile.newlines  # After reading a line, python gives us this
    assert isinstance(sep, str), 'multiple newline types found, aborting'

    # find a suitable seek position in binary mode
    with open(hugefile, 'rb') as hfile:
        hfile.seek(0, os.SEEK_END)
        linecount = 0
        pos = 0

        while linecount <= n + 1:
            # read at least n lines + 1 more; we need to skip a partial line later on
            try:
                hfile.seek(-bsize, os.SEEK_CUR)           # go backwards
                linecount += hfile.read(bsize).count(sep) # count newlines
                hfile.seek(-bsize, os.SEEK_CUR)           # go back again
            except IOError, e:
                if e.errno == errno.EINVAL:
                    # Attempted to seek past the start, can't go further
                    bsize = hfile.tell()
                    hfile.seek(0, os.SEEK_SET)
                    pos = 0
                    linecount += hfile.read(bsize).count(sep)
                    break
                raise  # Some other I/O exception, re-raise
            pos = hfile.tell()

    # Re-open in text mode
    with open(hugefile, 'r') as hfile:
        hfile.seek(pos, os.SEEK_SET)  # our file position from above

        for line in hfile:
            # We've located n lines *or more*, so skip if needed
            if linecount > n:
                linecount -= 1
                continue
            # The rest we yield
            yield line

score -1 · Accepted Answer

即使我更喜欢“tail”解决方案 - 如果您知道每行的最大字符数，您可以通过获取文件的大小来实现另一种可能的解决方案，打开文件处理程序并使用带有一些估计数字的“seek”方法您正在寻找的字符。

这个最终代码应该看起来像这样 - 只是为了解释为什么我也更喜欢尾部解决方案:) 祝你好运！

MAX_CHARS_PER_LINE = 80
size_of_file = os.path.getsize('15gbfile.txt')
file_handler = file.open('15gbfile.txt', "rb")
seek_index = size_of_file - (number_of_requested_lines * MAX_CHARS_PER_LINE)
file_handler.seek(seek_index)
buffer = file_handler.read()

您可以通过分析您读取的缓冲区的换行符来改进此代码。祝你好运（你应该使用tail解决方案;-)我很确定你可以为每个操作系统获得tail）

score -2 · Accepted Answer

此时首选的方法就是使用unix的tail来完成这项工作，并修改python以通过std input接受输入。

tail hugefile.txt -n1000 | python magic.py

这一点也不性感，但至少它照顾到了这份工作。我发现，大文件的负担太大了。至少对于我的python技能。因此，只需添加一点 nix 魔法来减少文件大小就容易多了。尾巴对我来说是新的。学到了一些东西，并想出了另一种使用终端的方式再次对我有利。谢谢大家。

python - 只读取大文本文件的结尾

4 回答 4

Related

Reference