python - python正则表达式匹配返回完整句子

Question

我正在尝试编写一个正则表达式，它将在句子列表中找到“松鼠”这个词。该表达式应返回包含单词“squirrel”的完整句子列表。

包含单词“squirrel”的句子可能类似于以下句子：

松鼠有一条长尾巴 (.) 说 (.) 长尾巴。
猫 (a)n(d) 松鼠 (a)n(d) 兔子 (a)n(d) 兔子 (a)n(d) (.)
松鼠+有尾巴

我现在re的样子是这样的

word_only += re.findall('(.*?' + word + '?!\S)', sentence)  
word_only += re.findall('.*?' + word + '\S+', sentence)

但它只返回单词（“squirrel”）前面的内容，而不是后面的内容。

有任何想法吗？谢谢

score 4 · Accepted Answer

这里根本不需要使用正则表达式。

#The example string:
s = '''the squirrel has a long tail (.) say (.) long tail .
cats (a)n(d) squirrels (a)n(d) rabbits (a)n(d) bunnys (a)n(d) (.)
the squirrel+has a tail'''

sentencelist = s.split(".") #split on periods
[sentence for sentence in sentencelist if sentence.find("squirrel") != -1]
#If you don't find any squirrels, hold fire!

另一方面，如果你有缩写/标题，这个脚本会分成太多的句子。当我不得不解决这样的问题时，我最终使用了一个正则表达式，如\.\s+(?=[A-Z]), 并在匹配项上进行拆分。这修复了缩写，例如 NAACP，但不修复标题，例如 Mr. Smithers。我最终建立了一本标题字典，并在我完成正则表达式和计数之前将句点替换掉。YMMV。

score 0 · Accepted Answer

如果我理解正确，您有一个字符串列表，每个字符串都包含一个 sentence。

squirrel_sentences = []
for sentence in sentences:
    if re.match(word):
       squirrel_sentences.append(sentence)

如果您有一个包含多个句子的单个字符串，您可以尝试此正则表达式的匹配，它会查找从句点到句点的字符范围，其中包含squirrel（也支持使用 and 的第一个和最后一个句子\A）\Z：

(?:\A|(?<=.))[^.]*squirrel[^.]*(?:.|\Z)

python - python正则表达式匹配返回完整句子

2 回答 2

Related

Reference