您是对的,NLTK 标记器确实是您在这种情况下应该使用的,因为它足够强大,可以处理几乎所有句子的定界,包括用“引号”结束句子。您可以执行以下操作(paragraph
来自随机生成器):
从...开始,
from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
最直观的方式:
for sentence in sent_tokenize(paragraph):
for highlight in highlights:
if highlight in sentence:
sentencesWithHighlights.append(sentence)
break
但是使用这种方法,我们实际上得到了一个有效的 3x 嵌套for
循环。这是因为我们首先检查 each sentence
,然后是 each highlight
,然后是sentence
for 中的每个子序列highlight
。
我们可以得到更好的性能,因为我们知道每个亮点的起始索引:
highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
sentencesWithHighlights.append(sentence)
break
subtractFromIndex += len(sentence)
无论哪种情况,我们都会得到:
sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']