python - Python Regex - 删除特殊字符但保留撇号

Question

我正在尝试从某些文本中删除所有特殊字符，这是我的正则表达式：

pattern = re.compile('[\W_]+', re.UNICODE)
words = str(pattern.sub(' ', words))

超级简单，但不幸的是它在使用撇号（单引号）时会引起问题。例如，如果我有单词“doesn't”，则此代码将返回“doesn”。

有什么方法可以调整这个正则表达式，使其不会在这样的情况下删除撇号？

编辑：这是我所追求的：

doesn't this mean it -technically- works?

应该：

这是否意味着它在技术上有效

score 12 · Accepted Answer

像这样？

>>> pattern=re.compile("[^\w']")
>>> pattern.sub(' ', "doesn't it rain today?")
"doesn't it rain today "

如果下划线也应该被过滤掉：

>>> re.compile("[^\w']|_").sub(" ","doesn't this _technically_ means it works? naïve I am ...")
"doesn't this  technically  means it works  naïve I am    "

score 1 · Accepted Answer

我能够使用此正则表达式将您的样本解析为单词列表：[a-z]*'?[a-z]+。

然后你可以用空格将列表的元素连接回来。

score 0 · Accepted Answer

0

怎么样

re.sub(r"[^\w' ]", "", "doesn't this mean it -technically- works?")

于 2012-07-09T21:44:35.570 回答

score 0 · Accepted Answer

怎么样([^\w']|_)+？

请注意，这不适用于以下情况：

doesn't this mean it 'technically' works?

这可能不是你所追求的。

python - Python Regex - 删除特殊字符但保留撇号

4 回答 4

Related

Reference