python - 向 mysql 插入 4 字节 unicode 引发的警告

Question

请看以下内容：

/home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string 
value: '\xF0\x9F\x91\x8A\xF0\x9F...' for column 't_content' at row 1
n = self.cursor.execute(self.sql, (item['topic'], item['url'], item['content']))

string '\xF0\x9F\x91\x8A，实际上是一个 4 字节的 unicode: u'\U0001f62a'。mysql 的字符集是 utf-8，但插入 4 字节 unicode 会截断插入的字符串。我google了一下这样的问题，发现5.5.3下的mysql不支持4字节unicode，可惜我的是5.5.224。我不想升级mysql服务器，所以我只想在python中过滤4字节的unicode，我尝试使用正则表达式但失败了。那么，有什么帮助吗？

score 10 · Accepted Answer

如果 MySQL 无法处理 4 字节或更多字节的 UTF-8 代码，那么您必须过滤掉 codepoint 上的所有 unicode 字符\U00010000；UTF-8 将低于该阈值的代码点编码为 3 个字节或更少。

您可以为此使用正则表达式：

>>> import re
>>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '

或者，您可以将该.translate()函数与仅包含None值的映射表一起使用：

>>> nohigh = { i: None for i in xrange(0x10000, 0x110000) }
>>> example.translate(nohigh)
u'Some example text with a sleepy face: '

但是，创建翻译表会消耗大量内存并且需要一些时间来生成；这可能不值得您付出努力，因为正则表达式方法更有效。

这一切都假定您使用的是 UCS-4 编译的 python。如果您的 python 是在 UCS-2 支持下编译的，那么您最多只能'\U0000ffff'在正则表达式中使用代码点，并且您一开始就不会遇到这个问题。

我注意到从 MySQL 5.5.3 开始，新添加的utf8mb4编解码器确实支持完整的 Unicode 范围。

score 2 · Accepted Answer

我认为您应该使用 utf8mb4 排序规则而不是 utf8 并运行

SET NAMES UTF8MB4

与 DB 连接后（链接、链接、链接）

score 0 · Accepted Answer

没有正则表达式的字符串的简单规范化和翻译：

def normalize_unicode(s):
    return ''.join([ unichr(k) if k < 0x10000 else 0xfffd for k in [ord(c) for c in s]])

python - 向 mysql 插入 4 字节 unicode 引发的警告

3 回答 3

Related

Reference