python - Python正则表达式打印包含两个已识别标记类的所有句子

Question

我希望读入一个 XML 文件，找到所有同时包含标记<emotion>和标记<LOCATION>的句子，然后将这些整个句子打印到一个唯一的行。这是代码示例：

import re

text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer <pronoun> I </pronoun> have ever heard." 

out = open('out.txt', 'w')

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bwonderful(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)
    out.write(line + '\n')

out.close()

这里的正则表达式抓取所有带有“精彩”和“奥马哈”的句子，并返回：

Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>.

这是完美的，但我真的想打印所有包含<emotion>and的句子<LOCATION>。但是，由于某种原因，当我将上面正则表达式中的“精彩”替换为“情感”时，正则表达式无法返回任何输出。因此，以下代码不会产生任何结果：

import re

text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard." 

out = open('out.txt', 'w')

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)
    out.write(line + '\n')

out.close()

我的问题是：如何修改我的正则表达式以便只抓取那些同时包含<emotion>and的句子<LOCATION>？对于其他人可以在这个问题上提供的任何帮助，我将不胜感激。

（对于它的价值，我也在努力在 BeautifulSoup 中解析我的文本，但想在认输之前给正则表达式最后一枪。）

score 1 · Accepted Answer

您的问题似乎是您的正则表达式期望空格 ( \s) 跟随匹配的单词，如下所示：

emotion(?=\s|\.|$)

因为当它是标签的一部分时，它后面跟着一个>，而不是一个空格，因为该前瞻失败，所以找不到匹配项。要修复它，您可以添加>之后的情感，例如：

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)

经测试，这似乎可以解决您的问题。确保并以类似方式对待“LOCATION”：

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bLOCATION>(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)

score 0 · Accepted Answer

如果我不明白你想要做的就是删除<emotion> </emotion> <LOCATION></LOCATION>？

好吧，如果这是你想做的，你可以这样做

import re

text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard." 

out = open('out.txt', 'w')

def remove_xml_tags(xml):
    content = re.compile(r'<.*?>')
    return content.sub('', xml)

data = remove_xml_tags(text)

out.write(data + '\n')

out.close()

score 0 · Accepted Answer

我刚刚发现可以完全绕过正则表达式。要查找（并打印）包含两个已识别标记类别的所有句子，您可以使用简单的 for 循环。如果它可以帮助那些在我发现自己的地方找到自己的人，我会发布我的代码：

# read in your file
f = open('sampleinput.txt', 'r')

# use read method to convert the read data object into string
readfile = f.read()

#########################
# now use the replace() method to clean data
#########################

# replace all \n with " "
nolinebreaks = readfile.replace('\n', ' ')

# replace all commas with ""
nocommas = nolinebreaks.replace(',', '')

# replace all ? with .
noquestions = nocommas.replace('?', '.')

# replace all ! with .
noexclamations = noquestions.replace('!', '.')

# replace all ; with .
nosemicolons = noexclamations.replace(';', '.')

######################
# now use replace() to get rid of periods that don't end sentences
######################

# replace all Mr. with Mr
nomisters = nosemicolons.replace('Mr.', 'Mr') 

#replace 'Mrs.' with 'Mrs' etc. 

cleantext = nomisters

#now, having cleaned the input, find all sentences that contain your two target words. To find markup, just replace "Toby" and "pipe" with <markupclassone> and <markupclasstwo>

periodsplit = cleantext.split('.')
for x in periodsplit:
    if 'Toby' in x and 'pipe' in x:
        print x

python - Python正则表达式打印包含两个已识别标记类的所有句子

3 回答 3

Related

Reference