5

我在抓取网站时遇到了一点问题。我按照scrapy的教程学习了如何抓取网站,我有兴趣在网站' https://www.leboncoin.fr '上测试它,但蜘蛛不起作用。所以,我试过:

scrapy shell 'https://www.leboncoin.fr'

但是,我没有对该网站的回应。

$ scrapy shell 'https://www.leboncoin.fr'
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: all_cote)
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'all_cote', 'DUPEFILTER_CLASS':    'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0,   'NEWSPIDER_MODULE': 'all_cote.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['all_cote.spiders']}
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled item pipelines:[]
2017-05-16 08:31:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-16 08:31:27 [scrapy.core.engine] INFO: Spider opened
2017-05-16 08:31:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leboncoin.fr/robots.txt> (referer: None)
2017-05-16 08:31:27 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x1039fbd30>
[s]   item       {}
[s]   request    <GET https://www.leboncoin.fr>
[s]   settings   <scrapy.settings.Settings object at 0x10716b8d0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

如果我使用:

view(response)

打印一个 AttributeError...

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-2c2544195c90> in <module>()
----> 1 view(response)

/usr/local/lib/python3.6/site-packages/scrapy/utils/response.py in open_in_browser(response, _openfunc)
     67     from scrapy.http import HtmlResponse, TextResponse
     68     # XXX: this implementation is a bit dirty and could be improved
---> 69     body = response.body
     70     if isinstance(response, HtmlResponse):
     71         if b'<base' not in body:

AttributeError:“NoneType”对象没有属性“body”

编辑 1:

To rrschmidt:完整的日志已更新,当我运行时

fetch('https:www.leboncoin.fr') 

我收到这个:

2017-05-16 08:33:15 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>

那么,我该如何解决呢?

谢谢你的回答,

克里斯

4

1 回答 1

5

看起来该网站已限制通过 robots.txt 进行抓取。尊重这个愿望通常是礼貌的。

但是,如果您真的想抓取该站点,您可以通过在 settings.py 中将 ROBOTSTXT_OBEY 设置更改为 false 来更改 scrapy 的默认行为

ROBOTSTXT_OBEY=False
于 2017-05-16T06:53:49.160 回答