14

我在这里有一个函数可以将给定的字符串截断为给定的字节长度:

LENGTH_BY_PREFIX = [
  (0xC0, 2), # first byte mask, total codepoint length
  (0xE0, 3), 
  (0xF0, 4),
  (0xF8, 5),
  (0xFC, 6),
]

def codepoint_length(first_byte):
    if first_byte < 128:
        return 1 # ASCII
    for mask, length in LENGTH_BY_PREFIX:
        if first_byte & mask == mask:
            return length
    assert False, 'Invalid byte %r' % first_byte

def cut_string_to_bytes_length(unicode_text, byte_limit):
    utf8_bytes = unicode_text.encode('UTF-8')
    cut_index = 0
    while cut_index < len(utf8_bytes):
        step = codepoint_length(ord(utf8_bytes[cut_index]))
        if cut_index + step > byte_limit:
            # can't go a whole codepoint further, time to cut
            return utf8_bytes[:cut_index]
        else:
            cut_index += step
    # length limit is longer than our bytes strung, so no cutting
    return utf8_bytes

在引入表情符号的问题之前,这似乎工作正常:

string = u"\ud83d\ude14"
trunc = cut_string_to_bytes_length(string, 100)

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<console>", line 5, in cut_string_to_bytes_length
  File "<console>", line 7, in codepoint_length
AssertionError: Invalid byte 152

谁能准确解释这里发生了什么,以及可能的解决方案是什么?

编辑:我这里有另一个代码片段,它不会引发异常,但有时会有奇怪的行为:

import encodings
_incr_encoder = encodings.search_function('utf8').incrementalencoder()

def utf8_byte_truncate(text, max_bytes):
    """ truncate utf-8 text string to no more than max_bytes long """
    byte_len = 0
    _incr_encoder.reset()
    for index,ch in enumerate(text):
        byte_len += len(_incr_encoder.encode(ch))
        if byte_len > max_bytes:
            break
    else:
        return text
    return text[:index]

>>> string = u"\ud83d\ude14\ud83d\ude14\ud83d\ude14\ud83d\ude14\ud83d\ude14" 
>>> print string
(prints a set of 5 Apple Emoji...)
>>> len(string)
10
>>> trunc = utf8_byte_truncate(string, 4)
>>> print trunc
???
>>> len(trunc)
1

因此,在第二个示例中,我有一个 10 字节的字符串,将其截断为 4,但发生了一些奇怪的事情,结果是一个大小为 1 字节的字符串。

4

3 回答 3

12

正如@jwpat7 所指出的那样,该算法是错误的。一个更简单的算法是:

# s = u'\ud83d\ude14\ud83d\ude14\ud83d\ude14\ud83d\ude14\ud83d\ude14'
# Same as above
s = u'\U0001f614' * 5   # Unicode character U+1F614

def utf8_lead_byte(b):
    '''A UTF-8 intermediate byte starts with the bits 10xxxxxx.'''
    return (ord(b) & 0xC0) != 0x80

def utf8_byte_truncate(text, max_bytes):
    '''If text[max_bytes] is not a lead byte, back up until a lead byte is
    found and truncate before that character.'''
    utf8 = text.encode('utf8')
    if len(utf8) <= max_bytes:
        return utf8
    i = max_bytes
    while i > 0 and not utf8_lead_byte(utf8[i]):
        i -= 1
    return utf8[:i]

# test for various max_bytes:
for m in range(len(s.encode('utf8'))+1):
    b = utf8_byte_truncate(s,m)
    print m,len(b),b.decode('utf8')

输出

0 0 
1 0 
2 0 
3 0 
4 4 
5 4 
6 4 
7 4 
8 8 
9 8 
10 8 
11 8 
12 12 
13 12 
14 12 
15 12 
16 16 
17 16 
18 16 
19 16 
20 20 
于 2012-12-06T06:56:03.020 回答
6

如果一个数 f 是这样的f & 0xF0 == 0xF0,那么情况也是如此,f & 0xC0 == 0xC0因为 0xF0 具有 0xC0 具有的所有位,然后是一些。也就是说,除其他问题外,您的codepoint_length()函数将在应该为 4 时返回步长 2。如果您反转 LENGTH_BY_PREFIX 列表,则该函数适用于第一个示例。

LENGTH_BY_PREFIX = [
  (0xFC, 6),
  (0xF8, 5),
  (0xF0, 4),
  (0xE0, 3), 
  (0xC0, 2), # first byte mask, total codepoint length
]
于 2012-12-05T16:58:15.213 回答
2

Mark 的 Python 3 代码版本:

# s = u'\ud83d\ude14\ud83d\ude14\ud83d\ude14\ud83d\ude14\ud83d\ude14'
# Same as above
s = u'\U0001f614' * 5   # Unicode character U+1F614

def utf8_lead_byte(b):
    '''A UTF-8 intermediate byte starts with the bits 10xxxxxx.'''
    return (b & 0xC0) != 0x80

def utf8_byte_truncate(text, max_bytes):
    '''If text[max_bytes] is not a lead byte, back up until a lead byte is
    found and truncate before that character.'''
    utf8 = text.encode('utf8')
    if len(utf8) <= max_bytes:
        return utf8
    i = max_bytes
    while i > 0 and not utf8_lead_byte(utf8[i]):
        i -= 1
    return utf8[:i]

# test for various max_bytes:
for m in range(len(s.encode('utf8'))+1):
    b = utf8_byte_truncate(s,m)
    print(m,len(b),b.decode('utf8'))

编辑:这是 Mark Tolonen 为 python3 改编的原始代码。之前的代码是错误的。感谢您的评论!

于 2017-05-08T13:25:37.457 回答