python - Python 错误：'utf8' 编解码器无法解码位置 85 中的字节 0x92：无效的起始字节

Question

我正在使用 python2.7 和 lxml。我的代码如下

import urllib
from lxml import html

def get_value(el):
    return get_text(el, 'value') or el.text_content()

response = urllib.urlopen('http://www.edmunds.com/dealerships/Texas/Frisco/DavidMcDavidHondaofFrisco/fullsales-504210667.html').read()
dom = html.fromstring(response)

try:
    description = get_value(dom.xpath("//div[@class='description item vcard']")[0].xpath(".//p[@class='sales-review-paragraph loose-spacing']")[0])
except IndexError, e:
    description = ''

代码在try内部崩溃，报错

UnicodeDecodeError at /
'utf8' codec can't decode byte 0x92 in position 85: invalid start byte

无法编码/解码的字符串是：不应该

我尝试过使用很多技术，包括 .encode('utf8')，但没有一个能解决问题。我有两个问题：

如何解决这个问题呢
当问题代码在尝试之间时，我的应用程序如何崩溃，除了

score 8 · Accepted Answer

8

该页面正在提供charset=ISO-8859-1。从中解码为 unicode。

[ 来自浏览器的详细信息快照。信用@老熊猫]

于 2012-04-18T14:16:57.680 回答

score 1 · Accepted Answer

您的 except 子句仅处理 IndexError 类型的异常。问题是 UnicodeDecodeError，它不是 IndexError - 所以异常不是由那个 except 子句处理的。

'get_value' 的作用也不清楚，这很可能是实际问题出现的地方。

score 0 · Accepted Answer

- 跳过错误字符，或将其正确解码为 unicode。
- 你只捕获 IndexError，而不是 UnicodeDecodeError

score 0 · Accepted Answer

解码对 unicode 的响应，在使用 fromhtml 解析之前正确处理错误（错误时忽略）。
捕获 UnicodeDecodeError 或所有错误。

python - Python 错误：'utf8' 编解码器无法解码位置 85 中的字节 0x92：无效的起始字节

4 回答 4

Related

Reference