python - 带有ØÆÅ字母的Python正则表达式

Question

我是 Python 新手，所以这似乎很容易。我正在尝试删除所有#，数字，如果同一字母连续重复两次以上，我需要将其更改为仅两个字母。这项工作完美，但不适用于ØÆÅ。

有什么想法可以用 ØÆÅ 字母来完成吗？

#!/usr/bin/python 
# -*- coding: utf-8 -*-

import math, re, sys, os, codecs
reload(sys)
sys.setdefaultencoding('utf-8')
text = "ån9d ånd ååååånd d9d flllllløde... :)asd "

# Remove anything other than digits
text = re.sub(r'#', "", text)
text = re.sub(r"\d", "", text)
text = re.sub(r'(\w)\1+', r'\1\1', text)
print "Phone Num : "+ text

我现在得到的结果是：

Phone Num : ånd ånd ååååånd dd flløde... :)asd

我想要的是：

Phone Num : ånd ånd åånd dd flløde... :)asd

score 5 · Accepted Answer

您需要使用 Unicode 值，而不是字节字符串。UTF-8 编码å为两个字节，在默认的 non-Unicode-aware 模式下运行时，匹配的正则表达式\w 仅匹配 ascii 字母、数字和下划线。

从re模块文档中\w：

未指定LOCALEand标志时，匹配任何字母数字字符和下划线；UNICODE这相当于 set [a-zA-Z0-9_]。使用LOCALE，它将匹配集合[0-9_]加上当前语言环境中定义为字母数字的任何字符。如果UNICODE设置，这将匹配字符[0-9_]加上 Unicode 字符属性数据库中分类为字母数字的任何内容。

不幸的是，即使您切换到正确使用 Unicode 值（使用 unicodeu''文字或通过将源数据解码为 unicode 值），使用 Unicode 正则表达式 ( re.sub(ur'...')) 并使用re.UNICODE标志来切换\w以匹配 Unicode 字母数字字符，Pythonre模块也有对 Unicode 匹配的支持仍然有限：

>>> print re.sub(ur'(\w)\1+', r'\1\1', text, re.UNICODE)
ånd ånd ååååånd dd flløde... :)asd

因为å不被识别为字母数字：

>>> print re.sub(ur'\w', '', text, re.UNICODE)
å å ååååå  ø... :)

解决方案是使用外部regex库re，它是添加适当的完整 Unicode 支持的库版本：

>>> import regex
>>> print regex.sub(ur'(\w)\1+', r'\1\1', text, re.UNICODE)
ånd ånd åånd dd flløde... :)asd

该模块可以做的不仅仅是识别 Unicode 值中的更多字母数字字符，有关更多详细信息，请参阅链接的包页面。

score 0 · Accepted Answer

改变：

text = u"ån9d ånd åååååååånd d9d flllllløde... :)asd "

和

text = re.sub(r'(\w)\1+', r'\1\1', text)

完整的解决方案

import math, re, sys, os, codecs
reload(sys)
sys.setdefaultencoding('utf-8')
text = u"ån9d ånd åååååååånd d9d flllllløde... :)asd "

# Remove anything other than digits
text = re.sub(r'#', "", text)
text = re.sub(r"\d", "", text)
text = re.sub(r'(\w)\1+', r'\1\1', text)
text = re.sub(r'(\W)\1+', r'\1\1', text)
print "1: "+ text

打印：

1: ånd ånd åånd dd flløde.. :)asd

python - 带有ØÆÅ字母的Python正则表达式

2 回答 2

Related

Reference