python - 如何跳转到巨大文本文件中的特定行？

Question

下面的代码是否有任何替代方案：

startFromLine = 141978 # or whatever line I need to jump to

urlsfile = open(filename, "rb", 0)

linesCounter = 1

for line in urlsfile:
    if linesCounter > startFromLine:
        DoSomethingWithThisLine(line)

    linesCounter += 1

如果我正在处理一个(~15MB)包含未知但长度不同的行的巨大文本文件，并且需要跳转到我事先知道的数字的特定行？当我知道我至少可以忽略文件的前半部分时，我会通过一个一个地处理它们而感到难过。如果有的话，寻找更优雅的解决方案。

score 127 · Accepted Answer

如果不至少读取一次文件，您将无法继续前进，因为您不知道换行符在哪里。您可以执行以下操作：

# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
    line_offset.append(offset)
    offset += len(line)
file.seek(0)

# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])

score 34 · Accepted Answer

线缓存：

该linecache模块允许从 Python 源文件中获取任何行，同时尝试使用缓存进行内部优化，这是从单个文件中读取多行的常见情况。模块使用它traceback来检索源代码行以包含在格式化的回溯中...

score 22 · Accepted Answer

如果行的长度不同，您实际上并没有那么多选择……遗憾的是，您需要处理行尾字符才能知道何时进入下一行。

但是，您可以通过将最后一个参数“open”更改为非 0 来显着加快速度并减少内存使用量。

0 表示文件读取操作是无缓冲的，非常慢且占用大量磁盘空间。1 表示文件是行缓冲的，这将是一个改进。任何大于 1（比如 8 kB，即 8192 或更高）的文件都会将文件块读入内存。您仍然可以通过for line in open(etc):.

score 12 · Accepted Answer

我可能被丰富的ram宠坏了，但15 M并不大。读入内存readlines()是我通常对这种大小的文件所做的。之后访问一行是微不足道的。

score 10 · Accepted Answer

我很惊讶没有人提到 islice

line = next(itertools.islice(Fhandle,index_of_interest,index_of_interest+1),None) # just the one line

或者如果你想要整个文件的其余部分

rest_of_file = itertools.islice(Fhandle,index_of_interest)
for line in rest_of_file:
    print line

或者如果你想要文件中的每一行

rest_of_file = itertools.islice(Fhandle,index_of_interest,None,2)
for odd_line in rest_of_file:
    print odd_line

score 5 · Accepted Answer

由于不阅读就无法确定所有行的长度，因此您别无选择，只能遍历起始行之前的所有行。你所能做的就是让它看起来不错。如果文件真的很大，那么您可能需要使用基于生成器的方法：

from itertools import dropwhile

def iterate_from_line(f, start_from_line):
    return (l for i, l in dropwhile(lambda x: x[0] < start_from_line, enumerate(f)))

for line in iterate_from_line(open(filename, "r", 0), 141978):
    DoSomethingWithThisLine(line)

注意：基于这种方法，索引为零。

score 4 · Accepted Answer

如果您不想读取内存中的整个文件..您可能需要提出除纯文本之外的某种格式。

当然，这完全取决于您要执行的操作，以及您跳过文件的频率。

例如，如果您要在同一个文件中多次跳转到行，并且您知道文件在使用它时不会更改，您可以这样做：
首先，遍历整个文件，并记录“ seek-location" 的一些关键行号（例如，曾经 1000 行），
然后如果你想要第 12005 行，跳转到 12000 的位置（你已经记录了）然后读 5 行你就会知道你'在第 12005 行，依此类推

score 4 · Accepted Answer

您可以使用 mmap 来查找线的偏移量。MMap 似乎是处理文件的最快方法

例子：

with open('input_file', "r+b") as f:
    mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    i = 1
    for line in iter(mapped.readline, ""):
        if i == Line_I_want_to_jump:
            offsets = mapped.tell()
        i+=1

然后使用 f.seek(offsets) 移动到您需要的行

score 4 · Accepted Answer

没有一个答案特别令人满意，所以这里有一个小片段可以提供帮助。

class LineSeekableFile:
    def __init__(self, seekable):
        self.fin = seekable
        self.line_map = list() # Map from line index -> file position.
        self.line_map.append(0)
        while seekable.readline():
            self.line_map.append(seekable.tell())

    def __getitem__(self, index):
        # NOTE: This assumes that you're not reading the file sequentially.  
        # For that, just use 'for line in file'.
        self.fin.seek(self.line_map[index])
        return self.fin.readline()

示例用法：

In: !cat /tmp/test.txt

Out:
Line zero.
Line one!

Line three.
End of file, line four.

In:
with open("/tmp/test.txt", 'rt') as fin:
    seeker = LineSeekableFile(fin)    
    print(seeker[1])
Out:
Line one!

这涉及进行大量文件搜索，但对于无法将整个文件放入内存的情况很有用。它会进行一次初始读取以获取行位置（因此它会读取整个文件，但不会将其全部保存在内存中），然后每次访问都会在事后查找文件。

我根据用户的判断在 MIT 或 Apache 许可下提供上面的代码片段。

score 3 · Accepted Answer

如果您事先知道文件中的位置（而不是行号），则可以使用file.seek()转到该位置。

编辑：您可以使用linecache.getline(filename, lineno)函数，该函数将返回 lineno 行的内容，但只有在将整个文件读入内存之后。如果您从文件中随机访问行（因为 python 本身可能想要打印回溯），但对于 15MB 文件来说不是很好。

score 3 · Accepted Answer

什么会生成您要处理的文件？如果它在您的控制之下，您可以在附加文件时生成一个索引（哪一行在哪个位置。）。索引文件可以是固定的行大小（空格填充或 0 填充数字）并且肯定会更小。从而可以快速读取和处理。

你要哪条线？
计算索引文件中对应行号的字节偏移量（可能是因为索引文件的行大小是恒定的）。
使用 seek 或其他直接跳转以从索引文件中获取行。
解析以获取实际文件相应行的字节偏移量。

score 3 · Accepted Answer

我遇到了同样的问题（需要从大文件的特定行中检索）。

当然，我每次都可以遍历文件中的所有记录并在计数器等于目标行时停止它，但是在您想要获得复数个特定行的情况下它不起作用。这导致主要问题得到解决 - 如何直接处理必要的文件位置。

我发现了下一个决定：首先我完成了每行开始位置的字典（键是行号，值是前行的累积长度）。

t = open(file,’r’)
dict_pos = {}

kolvo = 0
length = 0
for each in t:
    dict_pos[kolvo] = length
    length = length+len(each)
    kolvo = kolvo+1

最终，目标函数：

def give_line(line_number):
    t.seek(dict_pos.get(line_number))
    line = t.readline()
    return line

t.seek(line_number) - 执行文件修剪的命令，直到行开始。因此，如果您下一次提交 readline – 您将获得目标行。

使用这种方法，我节省了大量时间。

score 2 · Accepted Answer

这些行本身是否包含任何索引信息？如果每一行的内容类似于“ <line index>:Data”，那么seek()即使数量Data是可变的，也可以使用该方法对文件进行二进制搜索。你会寻找文件的中点，读取一行，检查它的索引是高于还是低于你想要的，等等。

否则，你能做的最好的就是readlines(). 如果您不想读取全部 15MB，则可以使用该sizehint参数至少将大量readline()s 替换为对readlines().

score 2 · Accepted Answer

如果您正在处理基于linux 系统的文本文件，则可以使用 linux 命令。对我来说，这很好用！

import commands

def read_line(path, line=1):
    return commands.getoutput('head -%s %s | tail -1' % (line, path))

line_to_jump = 141978
read_line("path_to_large_text_file", line_to_jump)

score 1 · Accepted Answer

这是一个readlines(sizehint)用于一次读取一大块行的示例。DNS 指出了该解决方案。我写这个例子是因为这里的其他例子都是面向单行的。

def getlineno(filename, lineno):
    if lineno < 1:
        raise TypeError("First line is line 1")
    f = open(filename)
    lines_read = 0
    while 1:
        lines = f.readlines(100000)
        if not lines:
            return None
        if lines_read + len(lines) >= lineno:
            return lines[lineno-lines_read-1]
        lines_read += len(lines)

print getlineno("nci_09425001_09450000.smi", 12000)

score 0 · Accepted Answer

@george 出色地建议mmap，它可能使用系统调用mmap。这是另一个演绎。

import mmap

LINE = 2  # your desired line

with open('data.txt','rb') as i_file, mmap.mmap(i_file.fileno(), length=0, prot=mmap.PROT_READ) as data:
  for i,line in enumerate(iter(data.readline, '')):
    if i!=LINE: continue
    pos = data.tell() - len(line)
    break

  # optionally copy data to `chunk`
  i_file.seek(pos)
  chunk = i_file.read(len(line))

print(f'line {i}')
print(f'byte {pos}')
print(f'data {line}')
print(f'data {chunk}')

score -1 · Accepted Answer

可以使用此函数返回第 n 行：

def skipton(infile, n):
    with open(infile,'r') as fi:
        for i in range(n-1):
            fi.next()
        return fi.next()

python - 如何跳转到巨大文本文件中的特定行？

17 回答 17

Related

Reference