python - 使用 python 和 lxml 抓取页面 - (, UnicodeEncodeError('ascii',

Question

我正在使用 python2.7 和 lxml 来获取页面。我不断收到以下错误。

(<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('ascii', u'Approximate Dimensions: 4\xbd" x 4" x 7" (assembled)', 25, 26, 'ordinal not in range(128)'), <traceback object at 0x7f9198ac48c0>)

我尝试了以下方法：

doc = lxml.html.document_fromstring(html)
for el in doc.iter('h2'):
    el.text_content().decode('utf-8','ignore')
    OR
    el.text_content().encode('ascii', 'ignore')

如何解决这些错误？我需要能够 1）保存到文本文件，然后 2）将文本文件上传到 MySQL。

谢谢

score 2 · Accepted Answer

尝试：

el.text_content().encode('utf-8')

它是 unicode，您想将它（作为文本）存储到 utf-8。

score 0 · Accepted Answer

标题所说的用于编码的页面可能与实际情况不同。如果页面的实际编码不是 utf-8，那么做正确的事情就有点棘手了。

首先，您应该查看返回的文本el.text_content()

x = el.text_content() print x

如果您仍然有一些编码字符串，例如/x09，则表示它尚未解码。

如果 x 是 unicode，（以 'u' 开头），您应该转换unicode为str并使用正确的编码（如cp1252或 sth）对其进行解码

chars = ''.join([chr(ord(x)) for x in el.text_content()]) /// It will change your dumb unicode to str result = chars.decode({try with different encoding until it doesn't throw an error}) /// now you decode str with proper format

python - 使用 python 和 lxml 抓取页面 - (, UnicodeEncodeError('ascii',

2 回答 2

Related

Reference