1

我有一个文件,它们被“\n”分段,每段的行数是未知的。该文件的示例如下所示:

800004
The London and North-Western's Euston Station was first, but at the eastern end of Euston Road the Great Northern constructed their King's Cross terminal. 
Initially the Midland Railway ran into King's Cross but a quarrel over access led them to construct next door to King's Cross their St Pancras terminal, which was topped by a statue of Britannia, a <tag "510285">calculated</> snook-cocking exercise because Britannia was the company emblem of the Midland's hated rival, the London and North-Western. 

800005
GROWTH in Malaysia's gross domestic product this year is expected to be 8.5 per cent.
Nearly two percentage points higher than the Treasury's estimate, Bank Negara, the central bank, reported yesterday. 
Last year's growth, <tag "510270">calculated</> by the bank, was 8.7 per cent, compared with 7.6 per cent by the Treasury.   

800006
He was a Catholic. 
When he visited the Pope, even then, he couldn't help <tag "510270">calculating</> the Pope's worldly riches (life-proprietor of the Sistine Chapel, landlord of the Vatican and contents &ellip. ). 

有没有更简单的方法从文本文件中获取片段?

我一直这样做:

doc = []
segments = []
for line in open(trainfile):
    if line == "\n":
        doc.append(segments)
        segments = []
    else:
        segments.append(line.strip())

for i in doc:
    print i
4

3 回答 3

6

使用生成器函数:

def per_section(it):
    section = []
    for line in it:
        if line.strip():
            section.append(line)
        else:
            yield ''.join(section)
            section = []
    # yield any remaining lines as a section too
    if section:
       yield ''.join(section)

这会产生每个部分,由空行分隔,作为一个字符串:

with open(sectionedfile, 'r') as inputfile:
    for section in per_section(inputfile):
        print section
于 2013-06-05T13:22:52.933 回答
3

好像itertools.groupby是你的朋友:

for k,section in groupby(file,key=str.isspace):
    if k:
       for line in section:
           ...
于 2013-06-05T13:28:24.483 回答
0

如果文件不是很大,那么您还可以str.split在以下位置使用和拆分'\n\n'

如果文件很大,请使用@Martijn Pieters 建议的方法

with open('abc') as f:
    data = f.read()
    segments = data.split('\n\n')
...     
for x in segments:
    print '--->',x

输出:

---> 800004
The London and North-Western's Euston Station was first, but at the eastern end of Euston Road the Great Northern constructed their King's Cross terminal. 
Initially the Midland Railway ran into King's Cross but a quarrel over access led them to construct next door to King's Cross their St Pancras terminal, which was topped by a statue of Britannia, a <tag "510285">calculated</> snook-cocking exercise because Britannia was the company emblem of the Midland's hated rival, the London and North-Western. 
---> 800005
GROWTH in Malaysia's gross domestic product this year is expected to be 8.5 per cent.
Nearly two percentage points higher than the Treasury's estimate, Bank Negara, the central bank, reported yesterday. 
Last year's growth, <tag "510270">calculated</> by the bank, was 8.7 per cent, compared with 7.6 per cent by the Treasury.   
---> 800006
He was a Catholic. 
When he visited the Pope, even then, he couldn't help <tag "510270">calculating</> the Pope's worldly riches (life-proprietor of the Sistine Chapel, landlord of the Vatican and contents &ellip. ). 
于 2013-06-05T13:27:29.020 回答