python - 使用 utf-16 解析 LXML Xpath 失败

Question

我正在解析以下页面：http : //www.amazon.de/product-reviews/B004K1K172 使用基于 lxml 的 etree 进行解析。

包含整个页面内容的内容变量

代码：

myparser = etree.HTMLParser(encoding="utf-16") #As characters are beyond utf-8
tree = etree.HTML(content,parser = myparser)
review = tree.xpath(".//*[@id='productReviews']/tr/td[1]/div[1]/text()")

这将返回一个空列表。

但是当我将代码更改为：

myparser = etree.HTMLParser(encoding="utf-8") #Neglecting some reviews having ascii character above utf-8
tree = etree.HTML(content,parser = myparser)
review = tree.xpath(".//*[@id='productReviews']/tr/td[1]/div[1]/text()")

现在我正在使用相同的 Xpath 获取正确的数据。但是大多数评论都被拒绝了。那么这是基于 lxml 的 xpath 或我的 xpath 实现的问题吗？

如何使用 utf-16 编码解析上述页面？

score 0 · Accepted Answer

要自动从 http 标头获取字符编码：

import cgi
import urllib2

from lxml import html

response = urllib2.urlopen("http://www.amazon.de/product-reviews/B004K1K172")

# extract encoding from Content-Type 
_, params = cgi.parse_header(response.headers.get('Content-Type', ''))
html_text = response.read().decode(params['charset'])

root = html.fromstring(html_text)
reviews = root.xpath(".//*[@id='productReviews']/tr/td[1]/div[1]/text()")

score 0 · Accepted Answer

根据 nymk 的建议

Parsed the page using ISO-8859-15 encoding. 因此更改代码中的以下行。

myparser = etree.HTMLParser(encoding="ISO-8859-15")
但是必须在 SQL 中进行更改以接受 utf-8 以外的编码。

python - 使用 utf-16 解析 LXML Xpath 失败

2 回答 2

Related

Reference