您需要切割到字节长度,因此您需要首先.encode('utf-8')
切割您的字符串,然后在代码点边界处切割它。
在 UTF-8 中,ASCII ( <= 127
) 是 1 字节。设置了两个或多个最高有效位( >= 192
) 的字节是字符起始字节;后面的字节数由设置的最高有效位的数量决定。其他任何东西都是连续字节。
如果把中间的多字节序列剪掉,可能会出现问题;如果一个字符不适合,它应该被完全剪切,直到起始字节。
这是一些工作代码:
LENGTH_BY_PREFIX = [
(0xC0, 2), # first byte mask, total codepoint length
(0xE0, 3),
(0xF0, 4),
(0xF8, 5),
(0xFC, 6),
]
def codepoint_length(first_byte):
if first_byte < 128:
return 1 # ASCII
for mask, length in LENGTH_BY_PREFIX:
if first_byte & mask == mask:
return length
assert False, 'Invalid byte %r' % first_byte
def cut_to_bytes_length(unicode_text, byte_limit):
utf8_bytes = unicode_text.encode('UTF-8')
cut_index = 0
while cut_index < len(utf8_bytes):
step = codepoint_length(ord(utf8_bytes[cut_index]))
if cut_index + step > byte_limit:
# can't go a whole codepoint further, time to cut
return utf8_bytes[:cut_index]
else:
cut_index += step
# length limit is longer than our bytes strung, so no cutting
return utf8_bytes
现在测试。如果.decode()
成功,我们就进行了正确的切割。
unicode_text = u"هيك بنكون" # note that the literal here is Unicode
print cut_to_bytes_length(unicode_text, 100).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 10).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 5).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 4).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 3).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 2).decode('UTF-8')
# This returns empty strings, because an Arabic letter
# requires at least 2 bytes to represent in UTF-8.
print cut_to_bytes_length(unicode_text, 1).decode('UTF-8')
您可以测试代码是否也适用于 ASCII。