python - 将 UTF-8 字符串作为内容的 unicode 转换为 str

Question

我正在使用 pyquery 来解析页面：

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()

但我得到的content是一个带有 utf-8 编码内容的 unicode 字符串：

u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'

我怎样才能在str不丢失内容的情况下将其转换为？

说清楚：

我想conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

不是conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

score 27 · Accepted Answer

如果您有一个unicode带有 UTF-8 字节的值，请编码为 Latin-1 以保留“字节”：

content = content.encode('latin1')

因为 Unicode 代码点 U+0000 到 U+00FF 都与 latin-1 编码一对一地映射；因此，这种编码将您的数据解释为文字字节。

对于您的示例，这给了我：

>>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1')
'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1').decode('utf8')
u'\u5c42\u53e0\u6837\u5f0f\u8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表

PyQuery使用requests或urllib来检索 HTML，在的情况下requests，使用.text响应的属性。这会根据单独在标头中设置的编码自动解码响应数据Content-Type，或者如果该信息不可用，则latin-1用于此（用于文本响应，但 HTML 是文本响应）。你可以通过传入一个encoding参数来覆盖它：

dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
              {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

此时您根本不必重新编码。

python - 将 UTF-8 字符串作为内容的 unicode 转换为 str

1 回答 1

Related

Reference