python - 有没有一种简单的方法来判断文件指针在哪个行号上？

Question

在 Python 2.5 中，我正在使用文件指针读取结构化文本数据文件（大小约为 30 MB）：

fp = open('myfile.txt', 'r')
line = fp.readline()
# ... many other fp.readline() processing steps, which
# are used in different contexts to read the structures

但是，在解析文件时，我遇到了一些有趣的东西，我想报告行号，所以我可以在文本编辑器中调查文件。我可以fp.tell()用来告诉我字节偏移在哪里（例如16548974L），但是没有“fp.tell_line_number()”可以帮助我将其转换为行号。

是否有 Python 内置或扩展来轻松跟踪和“告诉”文本文件指针所在的行号？

注意：我不是要求使用line_number += 1样式计数器，因为我fp.readline()在不同的上下文中调用，并且这种方法需要更多的调试，而不是在代码的右上角插入计数器。

score 17 · Accepted Answer

这个问题的一个典型解决方案是定义一个新类，该类包装一个现有实例file，它会自动计算数字。像这样的东西（就在我的脑海中，我还没有测试过）：

class FileLineWrapper(object):
    def __init__(self, f):
        self.f = f
        self.line = 0
    def close(self):
        return self.f.close()
    def readline(self):
        self.line += 1
        return self.f.readline()
    # to allow using in 'with' statements 
    def __enter__(self):
        return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()

像这样使用它：

f = FileLineWrapper(open("myfile.txt", "r"))
f.readline()
print(f.line)

看起来标准模块fileinput做了很多相同的事情（以及其他一些事情）；如果你愿意，你可以改用它。

score 13 · Accepted Answer

您可能会发现该fileinput模块很有用。它提供了一个通用接口，用于迭代任意数量的文件。文档中的一些相关亮点：

fileinput.lineno()

返回刚刚读取的行的累积行号。在读取第一行之前，返回 0。在读取最后一个文件的最后一行之后，返回该行的行号。

fileinput.filelineno()

返回当前文件中的行号。在读取第一行之前，返回 0。在读取最后一个文件的最后一行之后，返回文件中该行的行号。

score 13 · Accepted Answer

以下代码将在遍历文件时打印行号（指针当前所在的位置）（'testfile'）

file=open("testfile", "r")
for line_no, line in enumerate(file):
    print line_no     # The content of the line is in variable 'line'
file.close()

输出：

1
2
3
...

score 1 · Accepted Answer

我不这么认为，不是您想要的方式（如在 Python 文件句柄返回的标准内置功能中open）。

如果您不适合在阅读行或使用包装类时手动跟踪行号（顺便说一下，GregH 和 senderle 的出色建议），那么我认为您将不得不简单地使用该fp.tell()图和回到文件的开头，阅读直到你到达那里。

这不是一个太糟糕的选择，因为我假设错误条件比一切正常工作的可能性要小。如果一切正常，则没有影响。

如果有错误，那么您需要重新扫描文件。如果文件很大，这可能会影响您的感知性能 - 如果这是一个问题，您应该考虑到这一点。

score 0 · Accepted Answer

下面的代码创建了一个函数Which_Line_for_Position(pos)，它给出了位置pos的行号，也就是说，位于文件中位置pos的字符所在的行号。

这个函数可以使用任何位置作为参数，独立于文件指针当前位置的值和函数调用之前这个指针移动的历史。

因此，使用此功能，人们不仅限于在行的不间断迭代期间确定当前行的编号，就像 Greg Hewgill 的解决方案一样。

with open(filepath,'rb') as f:
    GIVE_NO_FOR_END = {}
    end = 0
    for i,line in enumerate(f):
        end += len(line)
        GIVE_NO_FOR_END[end] = i
    if line[-1]=='\n':
        GIVE_NO_FOR_END[end+1] = i+1
    end_positions = GIVE_NO_FOR_END.keys()
    end_positions.sort()

def Which_Line_for_Position(pos,
                            dic = GIVE_NO_FOR_END,
                            keys = end_positions,
                            kmax = end_positions[-1]):
    return dic[(k for k in keys if pos < k).next()] if pos<kmax else None

.

可以借助模块fileinput编写相同的解决方案：

import fileinput

GIVE_NO_FOR_END = {}
end = 0
for line in fileinput.input(filepath,'rb'):
    end += len(line)
    GIVE_NO_FOR_END[end] = fileinput.filelineno()
if line[-1]=='\n':
    GIVE_NO_FOR_END[end+1] = fileinput.filelineno()+1
fileinput.close()

end_positions = GIVE_NO_FOR_END.keys()
end_positions.sort()

def Which_Line_for_Position(pos,
                            dic = GIVE_NO_FOR_END,
                            keys = end_positions,
                            kmax = end_positions[-1]):
    return dic[(k for k in keys if pos < k).next()] if pos<kmax else None

但是这个解决方案有一些不便：

它需要导入模块文件输入
它会删除文件的所有内容！我的代码中一定有问题，但我不知道fileinput足以找到它。还是fileinput.input()函数的正常行为？
似乎在启动任何迭代之前首先完全读取该文件。如果是这样，对于一个非常大的文件，文件的大小可能会超过 RAM 的容量。我不确定这一点：我尝试使用 1,5 GB 的文件进行测试，但它相当长，我暂时放弃了这一点。如果这一点是正确的，它就构成了使用enumerate()的其他解决方案的论据

.

例子：

text = '''Harold Acton (1904–1994)
Gilbert Adair (born 1944)
Helen Adam (1909–1993)
Arthur Henry Adams (1872–1936)
Robert Adamson (1852–1902)
Fleur Adcock (born 1934)
Joseph Addison (1672–1719)
Mark Akenside (1721–1770)
James Alexander Allan (1889–1956)
Leslie Holdsworthy Allen (1879–1964)
William Allingham (1824/28-1889)
Kingsley Amis (1922–1995)
Ethel Anderson (1883–1958)
Bruce Andrews (born 1948)
Maya Angelou (born 1928)
Rae Armantrout (born 1947)
Simon Armitage (born 1963)
Matthew Arnold (1822–1888)
John Ashbery (born 1927)
Thomas Ashe (1836–1889)
Thea Astley (1925–2004)
Edwin Atherstone (1788–1872)'''


#with open('alao.txt','rb') as f:

f = text.splitlines(True)
# argument True in splitlines() makes the newlines kept

GIVE_NO_FOR_END = {}
end = 0
for i,line in enumerate(f):
    end += len(line)
    GIVE_NO_FOR_END[end] = i
if line[-1]=='\n':
    GIVE_NO_FOR_END[end+1] = i+1
end_positions = GIVE_NO_FOR_END.keys()
end_positions.sort()


print '\n'.join('line %-3s  ending at position %s' % (str(GIVE_NO_FOR_END[end]),str(end))
                for end in end_positions)

def Which_Line_for_Position(pos,
                            dic = GIVE_NO_FOR_END,
                            keys = end_positions,
                            kmax = end_positions[-1]):
    return dic[(k for k in keys if pos < k).next()] if pos<kmax else None

print
for x in (2,450,320,104,105,599,600):
    print 'pos=%-6s   line %s' % (x,Which_Line_for_Position(x))

结果

line 0    ending at position 25
line 1    ending at position 51
line 2    ending at position 74
line 3    ending at position 105
line 4    ending at position 132
line 5    ending at position 157
line 6    ending at position 184
line 7    ending at position 210
line 8    ending at position 244
line 9    ending at position 281
line 10   ending at position 314
line 11   ending at position 340
line 12   ending at position 367
line 13   ending at position 393
line 14   ending at position 418
line 15   ending at position 445
line 16   ending at position 472
line 17   ending at position 499
line 18   ending at position 524
line 19   ending at position 548
line 20   ending at position 572
line 21   ending at position 600

pos=2        line 0
pos=450      line 16
pos=320      line 11
pos=104      line 3
pos=105      line 4
pos=599      line 21
pos=600      line None

.

然后，有了函数Which_Line_for_Position()，很容易获得当前行的编号：只需将f.tell()作为参数传递给函数

但是警告：当使用f.tell()并在文件中移动文件指针时，绝对有必要以二进制模式打开文件：'rb' or 'rb+' or 'ab' or ....

score 0 · Accepted Answer

一种方法可能是遍历该行并明确计算已经看到的行数：

>>> f=open('text.txt','r')
>>> from itertools import izip
>>> from itertools import count
>>> f=open('test.java','r')
>>> for line_no,line in izip(count(),f):
...     print line_no,line

score 0 · Accepted Answer

最近解决了一个类似的问题，并提出了这个基于类的解决方案。

class TextFileProcessor(object):

    def __init__(self, path_to_file):
        self.print_line_mod_number = 0
        self.__path_to_file = path_to_file
        self.__line_number = 0

    def __printLineNumberMod(self):
        if self.print_line_mod_number != 0:
            if self.__line_number % self.print_line_mod_number == 0:
                print(self.__line_number)

    def processFile(self):
        with open(self.__path_to_file, 'r', encoding='utf-8') as text_file:
            for self.__line_number, line in enumerate(text_file, start=1):
                self.__printLineNumberMod()

                # do some stuff with line here.

将print_line_mod_number属性设置为您想要记录的节奏，然后调用processFile.

例如......如果你想要每 100 行反馈一次，它看起来像这样。

tfp = TextFileProcessor('C:\\myfile.txt')
tfp.print_line_mod_number = 100
tfp.processFile()

控制台输出将是

100
200
300
400
etc...

score 0 · Accepted Answer

with使用上下文管理器打开文件并在for循环中枚举行。

with open('file_name.ext', 'r') as f:
    [(line_num, line) for line_num, line in enumerate(f)]

score -1 · Accepted Answer

关于@eyquem 的解决方案，我建议使用mode='r'fileinput 模块和fileinput.lineno()选项，它对我有用。

这是我在代码中实现这些选项的方式。

    table=fileinput.input('largefile.txt',mode="r")
    if fileinput.lineno() >= stop : # you can disregard the IF condition but I am posting to illustrate the approach from my code.
           temp_out.close()

python - 有没有一种简单的方法来判断文件指针在哪个行号上？

9 回答 9

Related

Reference