python - 如何使用 Python Requests 库处理编码

Question

我与编码斗争太久了，今天我想打破思维障碍。

现在，我正在使用Requests抓取一堆网站，据我所知，它使用 HTTP 标头来确定页面使用的编码，当网站的标头丢失时回退到chardet 。从那里，它解码它下载的字节码，然后帮助我将r.text.

都好。

但我感到困惑的是，我从那里对文本做了一些工作，然后将其打印到标准输出，在我打印时提供编码：

 print foo.encode('utf-8')

问题是当我这样做时，打印出来的东西就搞砸了。在下文中，我希望在“判断”和“标准”这两个词之间加一个破折号：

 Declaratory judgmentsStandard of review.

相反，我得到了一个四四方方的东西，里面有四个小数字。当然，它似乎没有出现在这里，但我认为数字是 0097，这与我得到的结果相对应：

repr(foo)
u'Declaratory judgments\x97Standard of review.'

所以这是有道理的，但我的 emdash 呢？

该过程归结为：

请求下载页面并将文本智能解码为 unicode 对象
我和它一起工作
我将其编码为 utf-8 并打印出来。

问题出在哪里？这对我来说听起来像是神话般的 unicode 三明治，但显然我错过了一些东西。

score 4 · Accepted Answer

You are doing something odd. \x97 is an emdash in the cp1252 encoding. In a Unicode string, it's U+0097 END OF GUARDED AREA. Somehow, you are reading cp1252 bytes as Unicode. Show more of the code that got you to this state, and we can dig deeper.

PS: the Unicode sandwich is hardly mythical, it is an ideal to strive for! :)

python - 如何使用 Python Requests 库处理编码

1 回答 1

Related

Reference