python - 使用 lxml 在 python 中编码 - 复杂的解决方案

Question

我需要使用 lxml 下载和解析网页并构建 UTF-8 xml 输出。我认为伪代码中的模式更具说明性：

from lxml import etree

webfile = urllib2.urlopen(url)
root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True))

txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8))


output = etree.Element("out")
output.text = txt

outputfile.write(etree.tostring(output, encoding=utf8))

所以 webfile 可以是任何编码（lxml 应该处理这个）。输出文件必须为 utf-8。我不确定在哪里使用编码/编码。这个架构好吗？（我找不到关于 lxml 和编码的好教程，但我可以找到很多问题......）我需要强大的解决方案。

编辑：

所以为了将 utf-8 发送到 lxml 我使用

        converted = UnicodeDammit(webfile, isHTML=True)
        if not converted.unicode:
            print "ERR. UnicodeDammit failed to detect encoding, tried [%s]", \
                ', '.join(converted.triedEncodings)
            continue
        webfile = converted.unicode.encode('utf-8')

score 19 · Accepted Answer

lxml 对输入编码可能有点奇怪。最好发送 UTF8 并取出 UTF8。

您可能希望使用chardet模块或UnicodeDammit来解码实际数据。

你想做一些模糊的事情，比如：

import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)

我不确定你为什么要在 lxml 和 etree 之间移动，除非你正在与另一个已经使用 etree 的库进行交互？

score 2 · Accepted Answer

lxml 编码检测较弱。

但是，请注意，网页最常见的问题是缺少（或存在不正确的）编码声明。因此通常只使用 BeautifulSoup 的编码检测，称为 UnicodeDammit 就足够了，剩下的交给 lxml 自己的 HTML 解析器，它的速度要快几倍。

我建议使用UnicodeDammit检测编码并使用lxml进行解析。此外，您可以使用 http 标头Content-Type（您需要提取charset=ENCODING_NAME）来更精确地检测编码。

对于这个例子，我使用的是BeautifulSoup4（你还必须安装chardet以获得更好的自动检测，因为UnicodeDammit 在内部使用 chardet）：

from bs4 import UnicodeDammit

if http_charset == "":
    ud = UnicodeDammit(content, is_html=True)
else:
    ud = UnicodeDammit(content, override_encodings=[http_charset], is_html=True)
root = lxml.html.fromstring(ud.unicode_markup)

或者，为了使之前的答案更完整，您可以将其修改为：

if ud.original_encoding != 'utf-8':
    content = content.decode(ud.original_encoding, 'replace').encode('utf-8')

为什么这比简单使用 chardet 更好？

您不要忽略Content-Type HTTP 标头

内容类型：文本/html；字符集=utf-8
您不要忽略http-equiv元标记。例子：

... http-equiv="Content-Type" content="text/html; charset=UTF-8" ...
最重要的是，您正在使用chardet、cjkcodecs和iconvcodec编解码器等等的强大功能。

python - 使用 lxml 在 python 中编码 - 复杂的解决方案

2 回答 2

Related

Reference