python - 从二进制文件中读取 UTF-8 字符串

Question

我有一些文件包含一堆不同类型的二进制数据，我正在编写一个模块来处理这些文件。

其中，它包含以下格式的 UTF-8 编码字符串：2 字节大端字符串长度（我使用struct.unpack () 解析），然后是字符串。由于它是 UTF-8，因此字符串的字节长度可能大于stringLength，如果字符串包含多字节字符（更不用说弄乱文件中的所有其他数据），执行 read(stringLength) 会变得很短.

如何从文件中读取n 个UTF-8 字符（与n个字节不同），同时了解 UTF-8 的多字节属性？我已经用谷歌搜索了半个小时，我发现的所有结果要么不相关，要么做出了我无法做出的假设。

score 6 · Accepted Answer

给定一个文件对象和一些字符，您可以使用：

# build a table mapping lead byte to expected follow-byte count
# bytes 00-BF have 0 follow bytes, F5-FF is not legal UTF8
# C0-DF: 1, E0-EF: 2 and F0-F4: 3 follow bytes.
# leave F5-FF set to 0 to minimize reading broken data.
_lead_byte_to_count = []
for i in range(256):
    _lead_byte_to_count.append(
        1 + (i >= 0xe0) + (i >= 0xf0) if 0xbf < i < 0xf5 else 0)

def readUTF8(f, count):
    """Read `count` UTF-8 bytes from file `f`, return as unicode"""
    # Assumes UTF-8 data is valid; leaves it up to the `.decode()` call to validate
    res = []
    while count:
        count -= 1
        lead = f.read(1)
        res.append(lead)
        readcount = _lead_byte_to_count[ord(lead)]
        if readcount:
            res.append(f.read(readcount))
    return (''.join(res)).decode('utf8')

测试结果：

>>> test = StringIO(u'This is a test containing Unicode data: \ua000'.encode('utf8'))
>>> readUTF8(test, 41)
u'This is a test containing Unicode data: \ua000'

在 Python 3 中，将文件对象包装在一个对象中，并将解码留给本机和高效的 Python UTF-8 实现当然要容易得多。io.TextIOWrapper()

score 0 · Accepted Answer

UTF-8 中的一个字符可以是 1byte,2bytes,3byte3。

如果您必须逐字节读取文件，则必须遵循 UTF-8 编码规则。 http://en.wikipedia.org/wiki/UTF-8

大多数时候，您只需将编码设置为 utf-8，然后读取输入流。

你不需要关心你读了多少字节。

python - 从二进制文件中读取 UTF-8 字符串

2 回答 2

Related

Reference