2

Environment:

Scrapy 0.16.2 Twisted-12.2.0 python 2.7 macosx-10.6

Okey here is my problem:

I try to run

scrapy shell http://aaa.17domn.com/bt9/file.php/MERH77V.html

Error:

[ScrapyHTTPPageGetter,client] Unhandled Error
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/selectreactor.py", line 150, in _doReadOrWrite
        why = getattr(selectable, method)()

      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/tcp.py", line 202, in doRead
        return self._dataReceived(data)

      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/tcp.py", line 208, in _dataReceived
        rval = self.protocol.dataReceived(data)

      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/protocols/basic.py", line 564, in dataReceived
        why = self.lineReceived(line)

      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.2-py2.7.egg/scrapy/core/downloader/webclient.py", line 50, in lineReceived
        return HTTPClient.lineReceived(self, line.rstrip())

      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/web/http.py", line 450, in lineReceived
        self.extractHeader(self._header)

      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/web/http.py", line 406, in extractHeader
        key, val = header.split(':',1)
    exceptions.ValueError: need more than 1 value to unpack

I found the solution from https://groups.google.com/forum/#!msg/scrapy-users/xFKo8ggzPxs/VXDl3CZ4V4cJ They describe this is caused by twisted. Then I patched function extractHeader in /twisted/web/http.py from http://twistedmatrix.com/trac/ticket/2842. Its WORKS

BUT BUT, Hold on NOt yet!!!

I run another web

scrapy shell http://www1.wkdown.info/fs3/file.php/M994ATR.html

Error:

Traceback (most recent call last):

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/defer.py", line 551, in _runCallbacks
    current.result = callback(current.result, *args, **kw)

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.2-py2.7.egg/scrapy/core/downloader/webclient.py", line 122, in _build_response
    status = int(self.status)

ValueError: invalid literal for int() with base 10: 'html'

I think something happen on response headers. Scrapy cannot handle it well. Any idea? Thank you!

4

0 回答 0