16

我正在尝试从文本中提取包含指定单词的所有句子。

txt="I like to eat apple. Me too. Let's go buy some apples."
txt = "." + txt
re.findall(r"\."+".+"+"apple"+".+"+"\.", txt)

但它正在返回我:

[".I like to eat apple. Me too. Let's go buy some apples."]

代替 :

[".I like to eat apple., "Let's go buy some apples."]

请问有什么帮助吗?

4

7 回答 7

32

不需要正则表达式:

>>> txt = "I like to eat apple. Me too. Let's go buy some apples."
>>> [sentence + '.' for sentence in txt.split('.') if 'apple' in sentence]
['I like to eat apple.', " Let's go buy some apples."]
于 2013-04-16T09:07:14.570 回答
20
In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt)                                                                                                                             
Out[4]: ['I like to eat apple.', " Let's go buy some apples."]
于 2013-04-16T09:09:20.783 回答
9
In [7]: import re

In [8]: txt=".I like to eat apple. Me too. Let's go buy some apples."

In [9]: re.findall(r'([^.]*apple[^.]*)', txt)
Out[9]: ['I like to eat apple', " Let's go buy some apples"]

但请注意,@jamylaksplit基于 - 的解决方案更快:

In [10]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
1000000 loops, best of 3: 1.96 us per loop

In [11]: %timeit [s+ '.' for s in txt.split('.') if 'apple' in s]
1000000 loops, best of 3: 819 ns per loop

对于较大的字符串,速度差异较小,但仍然很重要:

In [24]: txt = txt*10000

In [25]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
100 loops, best of 3: 8.49 ms per loop

In [26]: %timeit [s+'.' for s in txt.split('.') if 'apple' in s]
100 loops, best of 3: 6.35 ms per loop
于 2013-04-16T09:07:00.647 回答
4

您可以使用str.split

>>> txt="I like to eat apple. Me too. Let's go buy some apples."
>>> txt.split('. ')
['I like to eat apple', 'Me too', "Let's go buy some apples."]

>>> [ t for t in txt.split('. ') if 'apple' in t]
['I like to eat apple', "Let's go buy some apples."]
于 2013-04-16T09:06:27.970 回答
2
r"\."+".+"+"apple"+".+"+"\."

这条线有点奇怪;为什么要连接这么多单独的字符串?你可以只使用 r'..+apple.+.'。

无论如何,你的正则表达式的问题是它的贪婪。默认情况下,ax+x尽可能频繁地匹配。因此,您.+将匹配尽可能多的字符(任何字符);包括点和apples。

您要改用的是非贪婪表达式;您通常可以通过?在末尾添加 a 来做到这一点:.+?.

这将使您得到以下结果:

['.I like to eat apple. Me too.']

如您所见,您不再获得两个苹果句子,但仍然获得Me too.. 那是因为您仍然匹配.之后apple,因此不可能不捕获以下句子。

一个有效的正则表达式是这样的:r'\.[^.]*?apple[^.]*?\.'

在这里你不看任何字符,而只看那些本身不是点的字符。我们还允许根本不匹配任何字符(因为在apple第一句中的之后没有非点字符)。使用该表达式会导致:

['.I like to eat apple.', ". Let's go buy some apples."]
于 2013-04-16T09:11:56.327 回答
0

显然,有问题的样本extract sentence containing substring不是
extract sentence containing word. 如何extract sentence containing word通过python解决问题如下:

一个词可以在句子的开头|中间|结尾。不限于问题中的示例,我将提供一个在句子中搜索单词的通用功能:

def searchWordinSentence(word,sentence):
    pattern = re.compile(' '+word+' |^'+word+' | '+word+' $')
    if re.search(pattern,sentence):
        return True

仅限于问题中的示例,我们可以解决如下:

txt="I like to eat apple. Me too. Let's go buy some apples."
word = "apple"
print [ t for t in txt.split('. ') if searchWordofSentence(word,t)]

对应的输出是:

['I like to eat apple']
于 2017-12-13T09:00:22.103 回答
0
import nltk
search = "test"
text = "This is a test text! Best text ever. Cool"
contains = [s for s in nltk.sent_tokenize(text) if search in s]
于 2021-10-27T12:07:02.663 回答