python - Python urllib、minidom 和解析国际字符

Question

当我尝试使用以下 URL 从 Google 天气 API 检索信息时，

http://www.google.com/ig/api?weather=Munich,Germany&hl=de

然后尝试用 minidom 解析它，我收到文档格式不正确的错误。

我使用以下代码

sock = urllib.urlopen(url) # above mentioned url
doc = minidom.parse(sock)

我认为响应中的德语字符是错误的原因。

这样做的正确方法是什么？

score 2 · Accepted Answer

这似乎有效：

sock = urllib.urlopen(url)
# There is a nicer way for this, but I don't remember right now:
encoding = sock.headers['Content-type'].split('charset=')[1]
data = sock.read()
dom = minidom.parseString(data.decode(encoding).encode('ascii', 'xmlcharrefreplace'))

我猜minidom不处理任何非ascii的东西。您可能想查看 lxml ，它确实如此。

score 1 · Accepted Answer

根据 python 的 urllib.urlopen，在标头中发送的编码是 iso-8859-1（尽管在这种情况下，firefox 的实时 http 标头似乎不同意我的看法 - 报告 utf-8）。在 xml 本身中没有指定编码 --> 这就是 xml.dom.minidom 假定它是 utf-8 的原因。

因此，以下内容应解决此特定问题：

import urllib
from xml.dom import minidom

sock = urllib.urlopen('http://www.google.com/ig/api?weather=Munich,Germany&hl=de')
s = sock.read()
encoding = sock.headers['Content-type'].split('charset=')[1] # iso-8859-1
doc = minidom.parseString(s.decode(encoding).encode('utf-8'))

编辑：在 Glenn Maynard 发表评论后，我更新了这个答案。我冒昧地从 Lennert Regebro 的答案中取出一行。

python - Python urllib、minidom 和解析国际字符

2 回答 2

Related

Reference