-1

我一直在尝试学习 Scrapy 教程,在项目顶层运行命令后,我得到以下输出:

2016-07-05 21:06:01 [scrapy] INFO: Scrapy 1.1.0 started (bot: tutorial)
2016-07-05 21:06:01 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}
2016-07-05 21:06:01 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-07-05 21:06:02 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-05 21:06:02 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-05 21:06:02 [scrapy] INFO: Enabled item pipelines:
[]
2016-07-05 21:06:02 [scrapy] INFO: Spider opened
2016-07-05 21:06:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-05 21:06:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-07-05 21:06:02 [scrapy] INFO: Closing spider (finished)
2016-07-05 21:06:02 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 7, 5, 13, 6, 2, 381000),
 'log_count/DEBUG': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2016, 7, 5, 13, 6, 2, 381000)}
2016-07-05 21:06:02 [scrapy] INFO: Spider closed (finished)

dmoz.py 是...

# -*- coding: utf-8 -*-
import scrapy
from tutorial.items import TutorialItem

class DmozSpider(scrapy.Spider):
    name = 'dmoz'
    allowed_domains = ['dmoz.org']
    strat_urls = ('http://www.dmoz.org/Computers/Programming/Languages/Python/Books/')

    def parse(self,response):
        lislink = response.xpath('/html/body/div[5]/div/section[3]/div/div/div[*]/div[3]/a')

        for li in lislink:
            item  = TutorialItem()
            item['link'] = li.xpath('@href').extract()
            yield item

items.py 是...

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    link = scrapy.Field()
    pass

但是,在 shell 中调试项目时,我可以获得 url。

D:\pythonweb\scrapy\test2>scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
2016-07-05 21:06:40 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-07-05 21:06:40 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-07-05 21:06:40 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-07-05 21:06:40 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-05 21:06:40 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-05 21:06:40 [scrapy] INFO: Enabled item pipelines:
[]
2016-07-05 21:06:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-07-05 21:06:40 [scrapy] INFO: Spider opened
2016-07-05 21:06:42 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x03BF0E30>
[s]   item       {}
[s]   request    <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   response   <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   settings   <scrapy.settings.Settings object at 0x03BF05F0>
[s]   spider     <DefaultSpider 'default' at 0x432b1d0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>> lislink = response.xpath('/html/body/div[5]/div/section[3]/div/div/div[*]/div[3]/a')
>>> lislink.xpath('@href').extract()
[u'http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html', u'http://www.brpreiss.com/books/opus7/html/book.html', u'http://www.diveintopython.net/', u'http://rhodesmill.org/brandon/2011/foundations-of-python-network-programming/', u'http://www.techbooksforfree.com/perlpython.shtml', u'http://www.freetechbooks.com/python-f6.html', u'http://greenteapress.com/thinkpython/', u'http://www.network-theory.co.uk/python/intro/', u'http://www.freenetpages.co.uk/hp/alan.gauld/', u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471219754.html', u'http://hetland.org/writing/practical-python/', u'http://sysadminpy.com/', u'http://www.qtrac.eu/py3book.html', u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0764548077.html', u'https://www.packtpub.com/python-3-object-oriented-programming/book', u'http://www.network-theory.co.uk/python/language/', u'http://www.pearsonhighered.com/educator/academic/product/0,,0130409561,00%2Ben-USS_01DBC.html', u'http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1', u'http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00%2Ben-USS_01DBC.html', u'http://www.informit.com/store/product.aspx?isbn=0672317354', u'http://gnosis.cx/TPiP/', u'http://www.informit.com/store/product.aspx?isbn=0130211192']
>>>

这是我的平台。

Scrapy    : 1.1.0
lxml      : 3.6.0.0
libxml2   : 2.9.0
Twisted   : 16.2.0
Python    : 2.7.12 (v2.7.12:d33e0cf91556, Jun 27 2016, 15:19:22) [MSC v.1500 32 bit (Intel)]
pyOpenSSL : 16.0.0 (OpenSSL 1.0.2h  3 May 2016)
Platform  : Windows-10-10.0.10586

4

1 回答 1

1

它不是strat_urls,它是start_urls,它需要是一个可迭代的(通常是一个列表),其中 url 作为项目:

class DmozSpider(scrapy.Spider):
    name = 'dmoz'
    allowed_domains = ['dmoz.org']
    start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/']
于 2016-07-05T13:41:18.793 回答