0

我有一个包含 unicode 字符串的文件:u"L'\xe9quipe le quotidien"

我有另一个文件,从 Windows 导出并iso-8859-1使用相同的字符串进行编码:("L'<E9>quipe le quotidien"这是我 shell 中的复制/粘贴less)。

将 Windows 文件的内容转换为与 Windows 文件decode('iso-8859-1').encode('utf8')中的字符串不同的字符串:L'équipe le quotidien.

进行这种比较的最佳方法是什么?我似乎无法将 latin1 字符串转换为 utf-8。

4

1 回答 1

5

Your file is not encoded to Latin-1 (iso-8859-1). You created a Mojibake instead; if interpreted as a Unicode string I had to encode back to Latin-1 then decode as UTF-8 instead:

>>> print u"L'équipe le quotidien.".encode('latin1').decode('utf8')
L'équipe le quotidien.

Generally speaking, you'd decode both files to unicode objects before comparing. Even then, you can still run into issues with Combining Diacritical Marks, where the letter é is actually represented with two codepoints, U+0065 LATIN SMALL LETTER E and U+0301 COMBINING ACUTE ACCENT.

You can work around that up to a point by normalising the text; pick one of decomposed or composed and normalise both strings to the same form; use the unicodedata.normalize() function. See Normalizing Unicode for more details on that.

于 2015-03-20T16:02:07.973 回答