Environment:
Scrapy 0.16.2 Twisted-12.2.0 python 2.7 macosx-10.6
Okey here is my problem:
I try to run
scrapy shell http://aaa.17domn.com/bt9/file.php/MERH77V.html
Error:
[ScrapyHTTPPageGetter,client] Unhandled Error
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/selectreactor.py", line 150, in _doReadOrWrite
why = getattr(selectable, method)()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/tcp.py", line 202, in doRead
return self._dataReceived(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/tcp.py", line 208, in _dataReceived
rval = self.protocol.dataReceived(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/protocols/basic.py", line 564, in dataReceived
why = self.lineReceived(line)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.2-py2.7.egg/scrapy/core/downloader/webclient.py", line 50, in lineReceived
return HTTPClient.lineReceived(self, line.rstrip())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/web/http.py", line 450, in lineReceived
self.extractHeader(self._header)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/web/http.py", line 406, in extractHeader
key, val = header.split(':',1)
exceptions.ValueError: need more than 1 value to unpack
I found the solution from https://groups.google.com/forum/#!msg/scrapy-users/xFKo8ggzPxs/VXDl3CZ4V4cJ They describe this is caused by twisted. Then I patched function extractHeader in /twisted/web/http.py from http://twistedmatrix.com/trac/ticket/2842. Its WORKS
BUT BUT, Hold on NOt yet!!!
I run another web
scrapy shell http://www1.wkdown.info/fs3/file.php/M994ATR.html
Error:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.2-py2.7.egg/scrapy/core/downloader/webclient.py", line 122, in _build_response
status = int(self.status)
ValueError: invalid literal for int() with base 10: 'html'
I think something happen on response headers. Scrapy cannot handle it well. Any idea? Thank you!