python - 用 python 编写 unicode - 这个字符有什么问题

Question

使用 python 2.7，我以 unicode 读取并以 utf-16-le 写入。大多数字符都被正确解释。但有些不是，例如 u'\u810a'，也称为unichr(33034)。以下代码代码写不正确：

import codecs
with open('temp.txt','w') as temp:
    temp.write(codecs.BOM_UTF16_LE)     
    text = unichr(33034)  # text = u'\u810a'
    temp.write(text.encode('utf-16-le'))

但是，当在上面替换时，这些东西中的任何一个都可以使代码正常工作。

unichr(33033) 和 unichr(33035) 工作正常。
'utf-8' 编码（无 BOM，字节顺序标记）。

如何识别无法正确写入的字符，以及如何使用 BOM 编写一个“utf-16-le”编码文件来打印这些字符或进行一些替换？

score 4 · Accepted Answer

您正在以文本模式打开文件，这意味着换行符/字节将被转换为本地约定。不幸的是，您尝试写入的字符包含一个字节，0A，该字节被解释为换行符，并且无法正确写入文件。

改为以二进制模式打开文件：

open('temp.txt','wb')

score 1 · Accepted Answer

@Joni 的答案是问题的根源，但如果你使用codecs.open它，它总是以二进制模式打开，即使没有指定。使用utf16编解码器也会自动使用本机字节序写入 BOM：

import codecs
with codecs.open('temp.txt','w','utf16') as temp:
    temp.write(u'\u810a')

十六进制转储temp.txt：

FF FE 0A 81

参考：codecs.open

score 0 · Accepted Answer

您已经在使用编解码器库。使用该文件时，您应该使用 open() 和 codecs.open() 交换以透明地处理编码。

import codecs
with codecs.open('temp.txt', 'w', encoding='utf-16-le') as temp:
    temp.write(unichr(33033))
    temp.write(unichr(33034))
    temp.write(unichr(33035))

如果之后出现问题，则可能是查看器有问题，而不是 Python 脚本有问题。

python - 用 python 编写 unicode - 这个字符有什么问题

3 回答 3

Related

Reference