python - lxml 中的 HTML 元素被错误编码，如Най

Question

我需要从网页打印 RSS 链接，但此链接解码错误。这是我的代码：

import urllib2
from lxml import html, etree
import chardet

data = urllib2.urlopen('http://facts-and-joy.ru/')
S=data.read()
encoding = chardet.detect(S)['encoding']
#S=S.decode(encoding)
#encoding='utf-8'

print encoding
parser = html.HTMLParser(encoding=encoding)
content = html.document_fromstring(S,parser)
loLinks = content.xpath('//link[@type="application/rss+xml"]')

for oLink in loLinks:
    print oLink.xpath('@title')[0]
    print etree.tostring(oLink,encoding='utf-8')

这是我的输出：

utf-8
Позитивное мышление RSS Feed
<link rel="alternate" type="application/rss+xml" title="&#x41F;&#x43E;&#x437;&#x438;&#x442;&#x438;&#x432;&#x43D;&#x43E;&#x435; &#x43C;&#x44B;&#x448;&#x43B;&#x435;&#x43D;&#x438;&#x435; RSS Feed" href="http://facts-and-joy.ru/feed/" />&#13;

标题内容自己正确显示，但在 tostring() 内部它被奇怪的 &#... 符号替换。如何正确打印整个链接元素？

在此先感谢您的帮助！

score 2 · Accepted Answer

这是您的程序的简化版本，它可以工作：

from lxml import html

url = 'http://facts-and-joy.ru/'
content = html.parse(url)
rsslinks = content.xpath('//link[@type="application/rss+xml"]')

for link in rsslinks:
    print link.get('title')
    print html.tostring(link, encoding="utf-8")

输出：

Позитивное мышление RSS Feed
<link rel="alternate" type="application/rss+xml" title="Позитивное мышление RSS Feed" href="http://facts-and-joy.ru/feed/">&#13;

关键线是

print html.tostring(link, encoding="utf-8")

这是您必须在原始程序中更改的唯一内容。

使用html.tostring()而不是etree.tostring()产生实际字符而不是数字字符引用。你也可以使用etree.tostring(link, method="html", encoding="utf-8").

目前尚不清楚为什么“html”和“xml”输出方法之间存在这种差异。这篇到 lxml 邮件列表的帖子没有得到任何回复：https ://mailman-mail5.webfaction.com/pipermail/lxml/2011-September/006131.html 。

python - lxml 中的 HTML 元素被错误编码，如Най

1 回答 1

Related

Reference