python - 使用 Python 解码 HTML 实体

Question

以下 Python 代码使用 BeautifulStoneSoup 获取托尔金的“胡林的孩子们”的 LibraryThing API 信息。

import urllib2

from BeautifulSoup import BeautifulStoneSoup

URL = ("http://www.librarything.com/services/rest/1.0/"
            "?method=librarything.ck.getwork&id=1907912"
            "&apikey=2a2e596b887f554db2bbbf3b07ff812a")

soup = BeautifulStoneSoup(urllib2.urlopen(URL),
                          convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
title_field = soup.find('field', attrs={'name': 'canonicaltitle'})
print title_field.find('fact').string

不幸的是，它打印出的不是“Húrin”，而是“HÃºrin”。这显然是一个编码问题，但我无法弄清楚我需要做什么才能获得预期的输出。帮助将不胜感激。

score 4 · Accepted Answer

在网页的源代码中，它看起来像这样：The Children of HÃºrin. 所以编码已经在他们这边的某个地方被破坏了，甚至在它被转换为 XML 之前......

如果这是所有书籍的普遍问题并且您需要解决它，这似乎有效：

unicode(title_field.find('fact').string).encode("latin1").decode("utf-8")

score 1 · Accepted Answer

该网页可能在其编码方面撒谎。输出看起来像 UTF-8。如果最后有一个 str ，则需要将其解码为 UTF-8。如果你有一个 unicode，那么你需要先编码为 Latin-1。

python - 使用 Python 解码 HTML 实体

2 回答 2

Related

Reference