python - 如何用正则表达式匹配句子中的表情符号

Question

我正在使用 Python 来处理微博（中国类似推特的服务）句子。句子中有一些表情符号，对应的unicode是\ue317etc。为了处理句子，我需要用gbk对句子进行编码，见下图：

 string1_gbk = string1.decode('utf-8').encode('gb2312')

会有一个UnicodeEncodeError:'gbk' codec can't encode character u'\ue317'

我试过\\ue[0-9a-zA-Z]{3}了，但没有用。我如何在句子中匹配这些表情符号？

score 4 · Accepted Answer

'\ue317'不是的子字符串u"asdasd \ue317 asad"- 它是人类可读的 unicode 字符表示，并且不能被正则表达式匹配。正则表达式与repr(u'\ue317')

score 2 · Accepted Answer

尝试

string1_gbk = string1.decode('utf-8').encode('gb2312', 'replace')

应该输出吗？而不是那些表情符号。

score 1 · Accepted Answer

这可能是因为反斜杠是正则表达式语法中的特殊转义字符。以下对我有用：

>>> test_str = 'blah blah blah \ue317 blah blah \ueaa2 blah ue317'
>>> re.findall(r'\\ue[0-9A-Za-z]{3}', test_str)
['\\ue317', '\\ueaa2']

请注意，它不会错误地匹配ue317末尾没有反斜杠的末尾。显然，re.sub()如果您希望替换那些字符串，请使用。

3 回答 3