23

我正在设计一个正则表达式来拆分给定文本中的所有实际单词


输入示例:

"John's mom went there, but he wasn't there. So she said: 'Where are you'"


预期输出:

["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]



我想到了这样的正则表达式:

"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"

在 Python 中拆分后,结果包含None项目和空格。

如何摆脱无项目?为什么空格不匹配?


编辑:
在空格上拆分,将给出如下项目:["there."]
并且在非字母上拆分,将给出如下项目:["John","s"]
在非字母上拆分',将给出如下项目:["'Where","you'"]

4

4 回答 4

26

您可以使用字符串函数代替正则表达式:

to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"

for c in to_be_removed:
    s = s.replace(c, '')
s.split()

但是,在您的示例中,您不想删除撇号,John's但您希望将其删除you!!'。所以字符串操作在这一点上失败了,你需要一个微调的正则表达式。

编辑:可能一个简单的正则表达式可以解决您的问题:

(\w[\w']*)

它将捕获所有以字母开头的字符并继续捕获,而下一个字符是撇号或字母。

(\w[\w']*\w)

这第二个正则表达式是针对一个非常具体的情况......第一个正则表达式可以捕获像you'. 这将避免这种情况,并且仅在 is 在单词内(不在开头或结尾)时才捕获撇号。但是在那一点上,出现了一种情况,您无法Moss' mom使用第二个正则表达式捕获撇号。您必须决定是否在以 wit 结尾并定义所有权的名称中捕获尾随撇号。

例子:

rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']

更新 2:我在我的正则表达式中发现了一个错误!它不能捕获单个字母后跟撇号之类的A'。固定的全新正则表达式在这里:

(\w[\w']*\w|\w)

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']
于 2012-10-03T09:25:37.610 回答
8

你的正则表达式中有太多的捕获组;使它们不被捕获:

(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)

演示:

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']

这只返回一个为空的元素。

于 2012-10-03T09:14:56.790 回答
2

这个正则表达式只允许一个结束撇号,后面可能跟一个字符:

([\w][\w]*'?\w?)

演示:

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
>>> re.compile("([\w][\w]*'?\w?)").findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', "a'"]
于 2013-05-02T22:32:03.687 回答
0

我是 python 新手,但我想我已经弄清楚了

import re
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
result = re.findall(r"(.+?)[\s'\",!]{1,}", s)
print(result)

结果 ['John', 's', 'mom', 'went', 'there', 'but', 'he', 'wasn', 't', 'there.', 'So', 'she' , '说:', '在哪里', '是', '你']

于 2021-05-14T10:58:53.137 回答