python - python re (regex) 是否可以替代 \u unicode 转义序列？

Question

Python 将 \uxxxx 视为字符串文字中的 unicode 字符转义（例如 u"\u2014" 被解释为 Unicode 字符 U+2014）。但我刚刚发现（Python 2.7）标准正则表达式模块不会将 \uxxxx 视为 unicode 字符。例子：

codepoint = 2014 # Say I got this dynamically from somewhere

test = u"This string ends with \u2014"
pattern = r"\u%s$" % codepoint
assert(pattern[-5:] == "2014$") # Ends with an escape sequence for U+2014
assert(re.search(pattern, test) != None) # Failure -- No match (bad)
assert(re.search(pattern, "u2014")!= None) # Success -- This matches (bad)

显然，如果您能够将您的正则表达式模式指定为字符串文字，那么您可以获得与正则表达式引擎本身理解 \uxxxx 转义相同的效果：

test = u"This string ends with \u2014"
pattern = u"\u2014$"
assert(pattern[:-1] == u"\u2014") # Ends with actual unicode char U+2014
assert(re.search(pattern, test) != None)

但是如果你需要动态地构建你的模式呢？

score 4 · Accepted Answer

4

使用该unichr()函数从代码点创建 unicode 字符：

pattern = u"%s$" % unichr(codepoint)

于 2013-05-14T11:17:21.560 回答

score 1 · Accepted Answer

一种可能性是，与其直接调用 re 方法，不如将它们包装在可以理解 \u 代表它们转义的东西中。像这样的东西：

def my_re_search(pattern, s):
    return re.search(unicode_unescape(pattern), s)

def unicode_unescape(s):
        """
        Turn \uxxxx escapes into actual unicode characters
        """
        def unescape_one_match(matchObj):
                escape_seq = matchObj.group(0)
                return escape_seq.decode('unicode_escape')
        return re.sub(r"\\u[0-9a-fA-F]{4}", unescape_one_match, s)

它的工作示例：

pat  = r"C:\\.*\u20ac" # U+20ac is the euro sign
>>> print pat
C:\\.*\u20ac

path = ur"C:\reports\twenty\u20acplan.txt"
>>> print path
C:\reports\twenty€plan.txt

# Underlying re.search method fails to find a match
>>> re.search(pat, path) != None
False

# Vs this:
>>> my_re_search(pat, path) != None
True

感谢在 Python 中处理字符串中的转义序列指出了 decode("unicode_escape") 的想法。

但请注意，您不能只通过 decode("unicode_escape") 抛出整个模式。它有时会起作用（因为当您在前面放置反斜杠时，大多数正则表达式特殊字符不会改变它们的含义），但通常不会起作用。例如，这里使用 decode("unicode_escape") 改变了正则表达式的含义：

pat = r"C:\\.*\u20ac" # U+20ac is the euro sign
>>> print pat
C:\\.*\u20ac # Asks for a literal backslash

pat_revised  = pat.decode("unicode_escape")
>>> print pat_revised
C:\.*€ # Asks for a literal period (without a backslash)

python - python re (regex) 是否可以替代 \u unicode 转义序列？

2 回答 2

Related

Reference