python - Python，转换 4 字节字符以避免 MySQL 错误“不正确的字符串值：”

Question

我需要（在 Python 中）将 4 字节字符转换为其他字符。这是将其插入到我的 utf-8 mysql 数据库中而不会出现错误，例如：“Incorrect string value: '\xF0\x9F\x94\x8E' for column 'line' at row 1”

通过将 4 字节 unicode 插入 mysql 引发的警告显示以这种方式执行此操作：

>>> import re
>>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '

但是，我在评论中遇到与用户相同的错误，“...错误的字符范围..”这显然是因为我的 Python 是 UCS-2（不是 UCS-4）构建。但是后来我不清楚该怎么做？

score 15 · Accepted Answer

在 UCS-2 构建中，python 在内部为\U0000ffff代码点上的每个 unicode 字符使用 2 个代码单元。正则表达式需要与这些一起使用，因此您需要使用以下正则表达式来匹配这些：

highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')

此正则表达式匹配使用 UTF-16 代理对编码的任何代码点（请参阅UTF-16 代码点 U+10000 到 U+10FFFF。

为了使其在 Python UCS-2 和 UCS-4 版本之间兼容，您可以使用try:/except来使用其中一个：

try:
    highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
    # UCS-2 build
    highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')

UCS-2 python 构建的演示：

>>> import re
>>> highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '

python - Python，转换 4 字节字符以避免 MySQL 错误“不正确的字符串值：”

1 回答 1

Related

Reference