0

编辑 2 - 因为我的文件夹与我选择的名称混淆了,所以我不小心发布了错误的代码。请参阅下面的每个文件的准确代码,以获取包含我所有文件的正确文件夹。

设置

 # -*- coding: utf-8 -*-

# Scrapy settings for pics project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'pics'

SPIDER_MODULES = ['pics.spiders']
NEWSPIDER_MODULE = 'pics.spiders'

IMAGES_STORE = 'W:/scrapy/scraped/'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'pics (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'pics.middlewares.MyCustomSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'pics.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'pics.pipelines.ImagesPipeline': 300,
}



# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

管道.py

   import scrapy

from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
from PIL import Image


class PicsPipeline(object):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(url=img_url, meta=meta)


    def process_item(self, item, spider):
        return item

项目.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy.item import Item

class PicsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #  pass

    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_name = scrapy.Field()

博客点.py

    import scrapy

from scrapy.selector import Selector, HtmlXPathSelector
from pics.items import PicsItem
from PIL import Image


class BlogspotSpider(scrapy.Spider):
    name = "blogspot"
    allowed_domains = ['blogspot.fr']
    start_urls = ["http://10rambo.blogspot.fr/"]

    def parse(self, response):

        LOG_FILE = "spider.log"

        for sel in response.xpath('/html'):
            item = PicsItem()

        for elem in response.xpath("//div[contains(@class, 'hentry')]"):
                item['image_name'] = elem.xpath("//h3[contains(@class, 'entry-title')]/a/text()").extract()
                url = elem.xpath("//div[contains(@class, 'entry-content')]/a/@href").extract_first()
                item['image_urls'] = url

                yield scrapy.Request(response.urljoin(url), callback=self.parse, dont_filter=True)

这是当前日志所说的:

2017-06-15 21:20:00 [scrapy] INFO: Scrapy 1.2.1 started (bot: pics)
2017-06-15 21:20:00 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'pics.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pics.spiders'], 'LOG_FILE': 'log.log', 'BOT_NAME': 'pics'}
2017-06-15 21:20:00 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-06-15 21:20:00 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-15 21:20:00 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-15 21:20:00 [scrapy] INFO: Enabled item pipelines:
['pics.pipelines.ImagesPipeline']
2017-06-15 21:20:00 [scrapy] INFO: Spider opened
2017-06-15 21:20:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-15 21:20:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-06-15 21:20:00 [scrapy] DEBUG: Crawled (200) <GET http://10rambo.blogspot.fr/robots.txt> (referer: None)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET http://10rambo.blogspot.fr/> (referer: None)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (404) <GET https://2.bp.blogspot.com/robots.txt> (referer: None)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://12manrambotapes.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] INFO: Closing spider (finished)
2017-06-15 21:20:02 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3247,
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 2120514,
 'downloader/response_count': 10,
 'downloader/response_status_count/200': 9,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 6, 15, 19, 20, 2, 80000),
 'log_count/DEBUG': 11,
 'log_count/ERROR': 7,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 10,
 'scheduler/dequeued': 8,
 'scheduler/dequeued/memory': 8,
 'scheduler/enqueued': 8,
 'scheduler/enqueued/memory': 8,
 'spider_exceptions/AttributeError': 7,
 'start_time': datetime.datetime(2017, 6, 15, 19, 20, 0, 349000)}
2017-06-15 21:20:02 [scrapy] INFO: Spider closed (finished)
4

2 回答 2

0

蜘蛛错误处理 https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg>

它不是 html - 它是 jpg 文件。它没有xpath。

于 2017-06-16T06:36:38.563 回答
0

这部分日志说明出了什么问题:

File "W:\path\to\mypics\spiders\blogspot.py", line 29, in parse
  yield Request(response.urljoin(url), callback=self.parse)
NameError: global name 'Request' is not defined

好像您忘记为该类指定scrapy模块。Request但是,它与您发布的代码不对应。您没有blogspot.pybut spider.py,并且在parse方法内部,您正确指定了该类的scrapy模块。Request另一方面,您在Pipeline.pyin 方法中缺少它get_media_requests

yield Request(url=img_url, meta=meta)
于 2017-06-15T05:48:26.817 回答