9

我在 python 中使用 RE 表达式并尝试按句点和感叹号分割一大块文本。但是,当我拆分它时,结果中出现“无”

a = "This is my text...I want it to split by periods. I also want it to split \
by exclamation marks! Is that so much to ask?"

这是我的代码:

re.split('((?<=\w)\.(?!\..))|(!)',a)

请注意,我有这个 (?<=\w).(?!..) 因为我希望它避免省略号。尽管如此,上面的代码还是吐出:

['This is my text...I want it to split by periods', '.', None, ' \
I also want it to split by exclamation marks', None, '!', \
' Is that so much to ask?']

如您所见,句号或感叹号在哪里,它在我的列表中添加了一个特殊的“无”。为什么会这样,我该如何摆脱它?

4

3 回答 3

14

尝试以下操作:

re.split(r'((?<=\w)\.(?!\..)|!)', a)

您得到了,None因为您有两个捕获组,并且所有组都包含在re.split()结果中。

因此,任何时候匹配.第二个捕获组都是None,任何时候匹配!第一个捕获组都是None

结果如下:

['This is my text...I want it to split by periods',
 '.',
 ' I also want it to split by exclamation marks',
 '!',
 ' Is that so much to ask?']

如果您不想在结果中包含'.'and '!',只需删除围绕整个表达式的括号:r'(?<=\w)\.(?!\..)|!'

于 2012-07-03T22:49:56.117 回答
2

这是一个更简单的表达式(任何没有在句点之后或之前的句点),外部捕获组围绕整个 or|子句来避免None,而不仅仅是第一部分:

re.split(r'((?<!\.)\.(?!\.)|!)', a)

# Result:
# ['This is my text...I want it to split by periods', 
#  '.', 
#  ' I also want it to split by exclamation marks', 
#  '!', 
#  ' Is that so much to ask?']
于 2012-07-03T22:54:29.280 回答
1

之所以发生这种情况,是因为在每个感叹号之后都有一个空格字符,None这里返回。

您可以使用过滤器删除这些None

>>> import re
>>> a = "This is my text...I want it to split by periods. I also want it to split \
by exclamation marks! Is that so much to ask?"

>>> filter(lambda x:x!=None, re.split('((?<=\w)\.(?!\..))|(!)',a))

['This is my text...I want it to split by periods', '.', ' I also want it to split by exclamation marks', '!', ' Is that so much to ask?']
于 2012-07-03T22:48:10.060 回答