python - 在 Python 中转义 HTML 实体和 UTF-8

Question

我正在解析包含许多特殊字符（Unicode 和 HTML 实体形式）的 HTML 文件。尽管使用 Python 阅读了大量关于 Unicode 的文档，但我仍然无法正确转换 HTML 实体。

这是我运行的测试：

>>> import HTMLParser
>>> p = HTMLParser.HTMLParser()
>>> s = p.unescape("&#139;")
>>> repr(s)
"u'\\x8b'"
>>> print s 
Â‹ # !!!
>>> s
u'\x8b'
>>> print s.encode("latin1")
‹ # OK, it prints fine in latin1, but I need UTF-8 ...
>>> print s.encode("utf8")
Â‹ # !!!

>>> import codecs
>>> out = codecs.open("out8.txt", encoding="utf8", mode="w")
>>> out.write(s)
# Viewing the file as ANSI gives me Â‹ # !!!
# Viewing the file as UTF8 gives NOTHING, as if the file were empty # !!!

将未转义的字符串 s 写入 UTF8 文件的正确方法是什么？

score 3 · Accepted Answer

U+008B 是一个控制字符，因此什么也看不见并不稀奇。"‹" 是 U+2039 左单引号，甚至不是 Latin-1。但是，它是CP1252中的字符 0x8B 。并且不要再依赖 Windows 控制台输出来告诉您什么是正确的或不正确的，除非您chcp 65001事先运行。

python - 在 Python 中转义 HTML 实体和 UTF-8

1 回答 1

Related

Reference