python - 正则表达式匹配 Unicode 字符在不同的字符串中表现异常

Question

好的，我正在对一些字符串进行 unicode 正则表达式匹配。

这些是有问题的字符串。不是两个单独的行，而是两个单独的字符串。

\u2018Mummy\u2019 Reboot May Get \u2018Mama\u2019 Director

\u2018Glee\u2019 Star Grant Gustin to Play The Flash in \u2018Arrow\u2019 Season 2

我正在使用这个正则表达式来解析 unicode 引号中的标题。

regex = re.compile("\\u2018[^(?!\\u2018$)]*\\u2019",re.UNICODE)

使用 regex.findall() 返回我

['u2018Mama\\u2019']

和

['u2018Glee\\u2019', 'u2018Arrow\\u2019']

这带来了两个我无法弄清楚的问题。为什么不返回 \u2018，初始 \ 在哪里？

其次，有什么不同。我看不到它。最后，我将 \u2018 和 \u2019 替换为 '. 然后使用这个正则表达式。

re.compile("'[^']*'")

它在两个字符串中都匹配。这里有什么区别？我在 unicode 正则表达式中缺少什么？

先感谢您。

score 1 · Accepted Answer

#coding=utf8

import re

s=u'''\u2018Mummy\u2019 Reboot May Get \u2018Mama\u2019 Director
\u2018Glee\u2019 Star Grant Gustin to Play The Flash in \u2018Arrow\u2019 Season 2'''
print s
regex = re.compile(ur"‘[^(?!‘$)]*’",re.UNICODE)
m = regex.findall(s)
print m

[u'\u2018Mummy\u2019', u'\u2018Mama\u2019', u'\u2018Glee\u2019', u'\u2018Arrow\u2019']

python - 正则表达式匹配 Unicode 字符在不同的字符串中表现异常

1 回答 1

Related

Reference