python - 从文件中提取行的子集

Question

我有一些文件，其中有随机顺序的不同数量的标题行，然后是我需要的数据，这些数据跨越了相应标题给定的行数。前任Lines: 3

from: blah@blah.com
Subject: foobarhah
Lines: 3
Extra: More random stuff

Foo Bar Lines of Data, which take up
some arbitrary long amount  characters on a single line, but no  matter how long 
they still only take up the number of lines as specified in the header

如何在一次读取文件中获取该数据？PS 数据来自 20Newsgroups 语料库。

编辑：我猜只有在我放松对只读一次的限制时才有效的快速解决方案是：

[第一次阅读] 找出total_num_of_lines并匹配第一个Lines:标题，
[第二次阅读]我丢弃第一个(total_num_of_lines- header_num_of_lines)，然后阅读文件的其余部分

不过，我仍然不知道有一种方法可以一次性读取数据。

score 3 · Accepted Answer

我不太确定您甚至需要文件的开头才能获取其内容。考虑使用拆分：

_, contents = file_contents.split(os.linesep + os.linesep) # e.g. \n\n

但是，如果 lines 参数确实很重要 - 您可以使用上面建议的技术以及解析文件头：

headers, contents = file_contents.split(os.linesep + os.linesep)

# Get lines length
headers_list = [line.split for line in headers.splitlines()]
lines_count = int([line[1] for line in headers_list if line[0].lower() == 'lines:'][0])

# Get contents
real_contents = contents[:lines_count]

score 2 · Accepted Answer

假设我们有一般情况下可能有多个消息相互跟随，也许是这样的

from itertools import takewhile
def msgreader(file):
    while True:
        header = list(takewhile(lambda x: x.strip(), file))
        if not header: break
        header_dict = {k: v.strip() for k,v in (line.split(":", 1) for line in header)}
        line_count = int(header_dict['Lines'])
        message = [next(file) for i in xrange(line_count)] # or islice..
        yield message

会工作，在哪里

with open("53903") as fp:
    for message in msgreader(fp):
        print message

将给出所有列出的消息。对于这个特定的用例，上面的内容可能有点过头了，但坦率地说，提取所有标题信息并不比仅提取一行更难。不过，如果还没有解析这些消息的模块，我会感到惊讶。

score 1 · Accepted Answer

1

您需要存储标头是否已完成的状态。就这样。

于 2012-08-21T20:53:07.317 回答

python - 从文件中提取行的子集

3 回答 3

Related

Reference