python - 逐行拆分句子文件并使用 findall 提取某些参数

Question

我正在尝试浏览一个句子文件并在这些句子中逐行提取大写字母。

这是我正在处理的数据文件：

the dog_SUBJ bit_VERB the cat_OBJ
the man_SUBJ ran_VERB
the cat_SUBJ ate_VERB the cheese_OBJ

本质上，我希望程序为每一行输出“SUBJ”、“VERB”和“OBJ”。但是，对于我现在正在处理的脚本的每一行，输出是文件中每一行的所有大写字母，而不仅仅是该行中的大写字母。

这是我现在得到的输出：

第 0 行：the dog_SUBJ bit_VERB the cat_OBJ

['SUBJ', 'VERB', 'OBJ', 'SUBJ', 'VERB', 'SUBJ', 'VERB', 'OBJ']

第 1 行：the man_SUBJ ran_VERB

['SUBJ', 'VERB', 'OBJ', 'SUBJ', 'VERB', 'SUBJ', 'VERB', 'OBJ']

第 2 行：the cat_SUBJ ate_VERB the cheese_OBJ

['SUBJ', 'VERB', 'OBJ', 'SUBJ', 'VERB', 'SUBJ', 'VERB', 'OBJ']

例如，我希望程序输出第 0 行，'SUBJ'，'VERB'，'OBJ'，因为那是该行中的内容。

这是我目前正在使用的脚本：

import re, sys
f = open('findallEX.txt', 'r')
lines = f.readlines()
ii=0

for l in lines:
    sys.stdout.write('line %s: %s' %(ii, l))
    ii = ii + 1
    results = []
    for i in lines:
        results += re.findall(r'[A-Z]+', i)

谢谢！

score 2 · Accepted Answer

您无缘无故地重复行列表两次。尝试这个：

import re
with open('findallEX.txt', 'r') as f:

    for ii, line in enumerate(f):
        print 'line %s: %s' % (ii, line)
        results = re.findall(r'[A-Z]+', line)
        print results

（我还让事情变得更加 Pythonic；您应该使用上下文管理器来打开文件（使用with），并且应该避免手动控制循环变量。）

score 0 · Accepted Answer

没有正则表达式：

from itertools import chain, groupby
with open('text.txt') as f:
    print [''.join(g) for k, g in 
           groupby(chain.from_iterable(f), key=str.isupper) if k]

['SUBJ', 'VERB', 'OBJ', 'SUBJ', 'VERB', 'SUBJ', 'VERB', 'OBJ']

python - 逐行拆分句子文件并使用 findall 提取某些参数

2 回答 2

Related

Reference