python - Python 2.7 中的 UnicodeDecodeError

Question

我正在尝试在 python 中读取一个 utf-8 编码的 xml 文件，并且我正在对从文件中读取的行进行一些处理，如下所示：

next_sent_separator_index =  doc_content.find(word_value, int(characterOffsetEnd_value) + 1)

其中 doc_content 是从文件中读取的行，而 word_value 是来自同一行的字符串之一。每当 doc_content 或 word_value 有一些 Unicode 字符时，我都会收到上述行的编码相关错误。因此，我尝试先使用 utf-8 解码（而不是默认的 ascii 编码）对它们进行解码，如下所示：

next_sent_separator_index =  doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)

但我仍然收到 UnicodeDecodeError 如下：

Traceback (most recent call last):
  File "snippetRetriver.py", line 402, in <module>
    sentences_list,lemmatised_sentences_list = getSentenceList(form_doc)
  File "snippetRetriver.py", line 201, in getSentenceList
    next_sent_separator_index =  doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8: ordinal not in range(128)

谁能建议我一种合适的方法/方式来避免 python 2.7 中的此类编码错误？

score 5 · Accepted Answer

5

codecs.utf_8_decode(input.encode('utf8'))

于 2012-06-03T17:05:36.680 回答

python - Python 2.7 中的 UnicodeDecodeError

1 回答 1

Related

Reference