unicode - UnicodeDecodeError：数据意外结束

Question

我有一个巨大的文本文件，我想打开它。
我正在分块读取文件，避免与一次读取太多文件相关的内存问题。

代码片段：

def open_delimited(fileName, args):

    with open(fileName, args, encoding="UTF16") as infile:
        chunksize = 10000
        remainder = ''
        for chunk in iter(lambda: infile.read(chunksize), ''):
            pieces = re.findall(r"(\d+)\s+(\d+_\d+)", remainder + chunk)
            for piece in pieces[:-1]:
                yield piece
            remainder = '{} {} '.format(*pieces[-1]) 
        if remainder:
            yield remainder

代码抛出错误UnicodeDecodeError: 'utf16' codec can't decode bytes in position 8190-8191: unexpected end of data。

我试过UTF8了，得到了错误UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte。

latin-1并iso-8859-1提出了错误IndexError: list index out of range

输入文件示例：

b'\xff\xfe1\x000\x000\x005\x009\x00\t\x001\x000\x000\x005\x009\x00_\x009\x007\x004\x007\x001\x007\x005\x003\x001\x000\x009\x001\x00\t\x00\t\x00P\x00o\x00s\x00t\x00\t\x001\x00\t\x00H\x00a\x00p\x00p\x00y\x00 \x00B\x00i\x00r\x00t\x00h\x00d\x00a\x00y\x00\t\x002\x000\x001\x001\x00-\x000\x008\x00-\x002\x004\x00 \x00'

我还会提到我有几个这样的巨大文本文件。
UTF16他们中的许多人都可以正常工作，并且在特定文件上失败。

无论如何要解决这个问题？

score 9 · Accepted Answer

要忽略损坏的数据（可能导致数据丢失），请errors='ignore'在open()调用时设置：

with open(fileName, args, encoding="UTF16", errors='ignore') as infile:

open()功能文档指出：

'ignore'忽略错误。请注意，忽略编码错误可能会导致数据丢失。

这并不意味着您可以从遇到的明显数据损坏中恢复。

举例来说，假设在您的文件中某处删除或添加了一个字节。UTF-16 是每个字符使用 2 个字节的编解码器。如果有一个字节丢失或多余，那么所有跟在丢失或多余字节后面的字节对都将不对齐。

这可能会导致进一步解码问题，不一定立即。UTF-16 中有一些代码点是非法的，但通常是因为它们与另一个字节对结合使用；您的异常是针对此类无效代码点引发的。但是在那之前可能有成百上千个字节对是有效的 UTF-16，如果不是清晰的文本的话。

score 3 · Accepted Answer

我正在做同样的事情（以块的形式读取许多大文本文件）并且遇到了与其中一个文件相同的错误：

Traceback (most recent call last):
  File "wordcount.py", line 128, in <module>
    decodedtext = rawtext.decode('utf8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 9999999: unexpected end of data

这是我发现的：问题是一个特定的 Unicode 序列 ( \xc2\xa0\xc2\xa0) 跨越两个块。因此，该序列被拆分并变得无法解码。这是我解决它的方法：

# read text
rawtext = file.read(chunksize)

# fix splited end
if chunknumber < totalchunks:
    while rawtext[-1] != ' ':
        rawtext = rawtext + file.read(1)

# decode text
decodedtext = rawtext.decode('utf8')

这也解决了当它们跨越两个块时单词被切成两半的更普遍的问题。

score 0 · Accepted Answer

0

这也可能发生在 Python 3 中，当您读/写io.StringIO对象而不是io.BytesIO

于 2018-07-18T21:03:56.043 回答

unicode - UnicodeDecodeError：数据意外结束

3 回答 3

Related

Reference