0

我正在使用 Python 读取文件,并且在文件中有一些用“#”字符括起来的部分:

#HEADER1, SOME EXTRA INFO
data first section
1 2
1 233 
...
// THIS IS A COMMENT
#HEADER2, SECOND SECTION
452
134
// ANOTHER COMMENT
...
#HEADER3, THIRD SECTION

现在我编写了代码来读取文件,如下所示:

with open(filename) as fh:

    enumerated = enumerate(iter(fh.readline, ''), start=1)

    for lino, line in enumerated:

        # handle special section
        if line.startswith('#'):

            print("="*40)
            print(line)

            while True:

                start = fh.tell()
                lino, line = next(enumerated)

                if line.startswith('#'):
                    fh.seek(start)
                    break

                print("[{}] {}".format(lino,line))

输出是:

========================================
#HEADER1, SOME EXTRA INFO

[2] data first section

[3] 1 2

[4] 1 233 

[5] ...

[6] // THIS IS A COMMENT

========================================
#HEADER2, SECOND SECTION

[9] 452

[10] 134

[11] // ANOTHER COMMENT

[12] ...

========================================
#HEADER3, THIRD SECTION

现在您看到线路计数器lino不再有效,因为我正在使用seek. 此外,在中断循环之前减少它也无济于事,因为每次调用next. 那么在 Python 3.x 中有没有一种优雅的方法来解决这个问题呢?另外,有没有更好的方法来解决StopIteration而不在块中放置pass语句Except

更新

到目前为止,我已经根据@Dunes 的建议采用了一个实现。我不得不稍微改变一下,这样我就可以提前看看是否有新的部分开始了。我不知道是否有更好的方法来做到这一点,所以请加入评论:

类枚举文件:

    def __init__(self, fh, lineno_start=1):
        self.fh = fh
        self.lineno = lineno_start

    def __iter__(self):
        return self

    def __next__(self):
        result = self.lineno, self.fh.readline()
        if result[1] == '':
            raise StopIteration

        self.lineno += 1
        return result

    def mark(self):
        self.marked_lineno = self.lineno
        self.marked_file_position = self.fh.tell()

    def recall(self):
        self.lineno = self.marked_lineno
        self.fh.seek(self.marked_file_position)

    def section(self):
        pos = self.fh.tell()
        char = self.fh.read(1)
        self.fh.seek(pos)
        return char != '#'

然后读取文件并对每个部分进行如下处理:

# create enumerated object
e = EnumeratedFile(fh)

header = ""
for lineno, line, in e:

    print("[{}] {}".format(lineno, line))

    header = line.rstrip()

    # HEADER1
    if header.startswith("#HEADER1"):

        # process header 1 lines
        while e.section():

            # get node line
            lineno, line = next(e)
            # do whatever needs to be done with the line

     elif header.startswith("#HEADER2"):

         # etc.
4

2 回答 2

2

你不能改变enumerate()可迭代的计数器,不。

在这里你根本不需要,也不需要寻找。而是使用嵌套循环并缓冲节标题:

with open(filename) as fh:
    enumerated = enumerate(fh, start=1)
    header = None
    for lineno, line in enumerated:
        # seek to first section
        if header is None:
            if not line.startswith('#'):
                continue
            header = line

        print("=" * 40)
        print(header.rstrip())
        for lineno, line in enumerated:
            if line.startswith('#'):
                # new section
                header = line
                break

            # section line, handle as such
            print("[{}] {}".format(lineno, line.rstrip()))

这仅缓冲标题行;每次我们遇到一个新的头时,它都会被存储起来,并且当前节循环结束。

演示:

>>> from io import StringIO
>>> demo = StringIO('''\
... #HEADER1, SOME EXTRA INFO
... data first section
... 1 2
... 1 233 
... ...
... // THIS IS A COMMENT
... #HEADER2, SECOND SECTION
... 452
... 134
... // ANOTHER COMMENT
... ...
... #HEADER3, THIRD SECTION
... ''')
>>> enumerated = enumerate(demo, start=1)
>>> header = None
>>> for lineno, line in enumerated:
...     # seek to first section
...     if header is None:
...         if not line.startswith('#'):
...             continue
...         header = line
...     print("=" * 40)
...     print(header.rstrip())
...     for lineno, line in enumerated:
...         if line.startswith('#'):
...             # new section
...             header = line
...             break
...         # section line, handle as such
...         print("[{}] {}".format(lineno, line.rstrip()))
... 
========================================
#HEADER1, SOME EXTRA INFO
[2] data first section
[3] 1 2
[4] 1 233
[5] ...
[6] // THIS IS A COMMENT
========================================
#HEADER2, SECOND SECTION
[9] 134
[10] // ANOTHER COMMENT
[11] ...
>>> header
'#HEADER3, THIRD SECTION\n'

第三部分仍未处理,因为其中没有行,但如果有,则header变量已在预期中设置。

于 2014-12-09T15:57:43.620 回答
1

您可以复制迭代器,然后从该副本恢复迭代器。但是,您不能复制文件对象。您可以获取枚举数的浅表副本,然后在开始使用复制的枚举数时查找文件的相应部分。

然而,最好的办法是编写您的生成器类,其中包含__next__生成行号和行的方法,以及mark记录recall和返回到先前记录状态的方法。

class EnumeratedFile:

    def __init__(self, fh, lineno_start=1):
        self.fh = fh
        self.lineno = lineno_start

    def __iter__(self):
        return self

    def __next__(self):
        result = self.lineno, next(self.fh)
        self.lineno += 1
        return result

    def mark(self):
        self.marked_lineno = self.lineno
        self.marked_file_position = self.fh.tell()

    def recall(self):
        self.lineno = self.marked_lineno
        self.fh.seek(self.marked_file_position)

你会这样使用它:

from io import StringIO
demo = StringIO('''\
#HEADER1, SOME EXTRA INFO
data first section
1 2
1 233 
...
// THIS IS A COMMENT
#HEADER2, SECOND SECTION
452
134
// ANOTHER COMMENT
...
#HEADER3, THIRD SECTION
''')

e = EnumeratedFile(demo)
seen_header2 = False
for lineno, line, in e:
    if seen_header2:
        print(lineno, line)
        assert (lineno, line) == (2, "data first section\n")
        break
    elif line.startswith("#HEADER1"):
        e.mark()
    elif line.startswith("#HEADER2"):
        e.recall()
        seen_header2 = True
于 2014-12-09T17:20:28.210 回答