Python 将 \uxxxx 视为字符串文字中的 unicode 字符转义(例如 u"\u2014" 被解释为 Unicode 字符 U+2014)。但我刚刚发现(Python 2.7)标准正则表达式模块不会将 \uxxxx 视为 unicode 字符。例子:
codepoint = 2014 # Say I got this dynamically from somewhere
test = u"This string ends with \u2014"
pattern = r"\u%s$" % codepoint
assert(pattern[-5:] == "2014$") # Ends with an escape sequence for U+2014
assert(re.search(pattern, test) != None) # Failure -- No match (bad)
assert(re.search(pattern, "u2014")!= None) # Success -- This matches (bad)
显然,如果您能够将您的正则表达式模式指定为字符串文字,那么您可以获得与正则表达式引擎本身理解 \uxxxx 转义相同的效果:
test = u"This string ends with \u2014"
pattern = u"\u2014$"
assert(pattern[:-1] == u"\u2014") # Ends with actual unicode char U+2014
assert(re.search(pattern, test) != None)
但是如果你需要动态地构建你的模式呢?