python - Unicode re.sub() 不适用于 \g<0> （组 0）

Question

为什么不\g<0>使用 unicode 正则表达式？

当我尝试使用\g<0>普通字符串正则表达式在组之前和之后插入空格时，它可以工作：

>>> punct = """,.:;!@#$%^&*(){}{}|\/?><"'"""
>>> rx = re.compile('[%s]' % re.escape(punct))
>>> text = '''"anständig"'''
>>> rx.sub(r" \g<0> ",text)
' " anst\xc3\xa4ndig " '
>>> print rx.sub(r" \g<0> ",text)
 " anständig "

但使用 unicode 正则表达式，不会添加空格：

>>> punct = u""",–−—’‘‚”“‟„!£"%$'&)(+*-€/.±°´·¸;:=<?>@§#¡•[˚]»_^`≤…\«¿¨{}|"""
>>> rx = re.compile("["+"".join(punct)+"]", re.UNICODE)
>>> text = """„anständig“"""
>>> rx.sub(ur" \g<0> ", text)
'\xe2\x80\x9eanst\xc3\xa4ndig\xe2\x80\x9c'
>>> print rx.sub(ur" \g<0> ", text)
„anständig“

我如何\g在 unicode 正则表达式中工作？
如果 (1) 不可能，我如何让 unicode 正则表达式输入字符前后的空格punct？

score 1 · Accepted Answer

我认为你有两个错误。首先，您没有punct像第一个示例中那样转义，re.escape并且您有[]需要转义的字符。其次，text变量不是 unicode。有效的例子：

>>> punct = re.escape(u""",–−—’‘‚”“‟„!£"%$'&)(+*-€/.±°´·¸;:=<?>@§#¡•[˚]»_^`≤…\«¿¨{}|""")
>>> rx = re.compile("["+"".join(punct)+"]", re.UNICODE)
>>> text = u"""„anständig“"""
>>> print rx.sub(ur" \g<0> ", text)
 „ anständig “

python - Unicode re.sub() 不适用于 \g<0> （组 0）

1 回答 1

Related

Reference