更新:只需从命令行运行此错误即可重现此错误:
scrapy shell http://www.indiegogo.com/Straight-Talk-About-Your-Future
我正在使用 Scrapy 抓取网站。我抓取的每一页都声称是 UTF-8 编码的:
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
但偶尔,页面包含不属于 UTF-8 的字节,我会收到 Scrapy 错误,例如:
exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 131: invalid continuation byte
我仍然需要抓取这些页面,即使它们包含不可映射的字符。有没有办法告诉 Scrapy 覆盖页面声明的编码,并改用另一种(比如 UTF-16)?
这是捕获异常的地方:
2012-05-30 14:43:20+0200 [igg] ERROR: Spider error processing <GET http://www.site.com/page>
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 368, in callback
self._startRunCallbacks(result)
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 464, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Python/2.7/site-packages/scrapy/core/spidermw.py", line 61, in process_spider_output
result = method(response=response, result=result, spider=spider)