python - urllib2.urlopen 中的内容问题

Question

我有一些简单的 python 代码向服务器发出请求

html_page = urllib2.urlopen(baseurl, timeout=20)
print html_page.read()
html_page.close()

当我试图抓取一个包含“-”（破折号）字符的页面时。它是浏览器中的一个破折号，但是当我尝试打印出 urlopen 响应的请求时，它会打印为“？”。我尝试使用本地文件重新创建 html 页面，从源代码复制受影响的文本，但我无法重新创建它。

还有哪些其他因素/变量可能在起作用？这可能与编码有关吗？

更新：我现在知道这个问题与编码有关。我在“iso-8859-1”中编码的网站。问题是我仍然无法解码它，即使遵循Python：Converting from ISO-8859-1/latin1 to UTF-8

解码后的字符给了我：

>>>text.decode("iso-8859-1")
  u"</strong><p>Let's\x97in "
>>> text.decode("iso-8859-1").encode("utf8")
  "</strong><p>Let's\xc2\x97in "
>>> print text.decode("iso-8859-1").encode("utf8")
  </strong><p>Let'sin

角色完全消失了。有人有想法么？

score 1 · Accepted Answer

所以感谢亚当·罗森菲尔德，我发现了我的问题。该网站表明该字符集是 iso-8859-1

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

卜！我遇到问题的字符是“em dash”，用 Windows-1252 编码

>>> text.decode("windows-1252")
  </strong><p>Let's\u2014in"
>>> print text.decode("windows-1252")
  </strong><p>Let's—in

多谢你们！

python - urllib2.urlopen 中的内容问题

1 回答 1

Related

Reference