1

I've been trying to follow the Scrapy tutorial (as in, very very beginning) and after running the command at the project top level (i.e. the level with scrapy.cfg) I get the following output:

 mikey@ubuntu:~/scrapy/tutorial$ scrapy crawl dmoz
/usr/lib/pymodules/python2.7/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):
    BOT_VERSION: no longer used (user agent defaults to Scrapy now)
  warnings.warn(msg, ScrapyDeprecationWarning)
2014-01-26 04:17:06-0800 [scrapy] INFO: Scrapy 0.22.0 started (bot: tutorial)
2014-01-26 04:17:06-0800 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-01-26 04:17:06-0800 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'tutorial.items.TutorialItem', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'USER_AGENT': 'tutorial/1.0', 'BOT_NAME': 'tutorial'}
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled item pipelines: 
2014-01-26 04:17:06-0800 [dmoz] INFO: Spider opened
2014-01-26 04:17:06-0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-26 04:17:06-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-26 04:17:06-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-26 04:17:06-0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2014-01-26 04:17:07-0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2014-01-26 04:17:07-0800 [dmoz] INFO: Closing spider (finished)
2014-01-26 04:17:07-0800 [dmoz] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 472,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 14888,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 2,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 1, 26, 12, 17, 7, 63261),
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'response_received_count': 2,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'start_time': datetime.datetime(2014, 1, 26, 12, 17, 6, 567929)}
2014-01-26 04:17:07-0800 [dmoz] INFO: Spider closed (finished)
mikey@ubuntu:~/scrapy/tutorial$ 

(I.e. 0 pages crawled at 0/a second!!!!!!!!!!!!!!)

Troubleshooting so far: 1) Checked syntax of both items.py and dmoz_spider.py (both copied and pasted AND hand-typed) 2) Checked for problem online but cannot see others with similar issue 3) Checked folder structure etc making sure running command from correct place 4) Upgraded to latest version of scrapy

Any suggestions? My code is precisely as in the examples

dmoz_spider.py is......

from scrapy.spider import Spider

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

and items.py......

from scrapy.item import Item, Field

class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()
4

3 回答 3

2

首先你应该找出你想要抓取的内容

您将两个起始网址传递给了scrapy,因此它抓取了它们,但找不到更多要关注的网址。

该页面上的所有书籍链接都不符合 allowed_domains dmoz.org

你可以做yield Request([next url])爬取更多的链接,next url可以从响应中解析。

或者继承 CrawlSpider 并像这个例子一样指定规则。

于 2014-01-26T18:13:26.783 回答
1

这一行是反复打印的,首先当蜘蛛打开时,你的代码没有问题,你只是没有实现其他任何东西

于 2014-01-26T18:17:15.470 回答
0

您必须yield保存一个项目才能保存它,然后您必须yield Request(<next_url>)移动到新页面。

您可以查看此博客以了解如何开始使用 scrapy

https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/

于 2019-09-25T21:19:19.410 回答