我正在使用 scrapy 从网站https://pixabay.com/下载图像。我的工作代码如下 -
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from website.imageItems import imageItems
class imageSpider(Spider):
name = "imageCrawler"
start_urls = ['https://pixabay.com/en/goose-bird-isolated-feather-1988657/']
def parse(self, response):
img = imageItems()
image_urls = response.xpath('//div[@id="media_container"]/img/@src').extract_first()
yield imageItems(image_urls = [image_urls])
使用此代码,我可以完美地下载图像https://cdn.pixabay.com/photo/2017/01/18/01/07/goose-1988657_960_720.png 。但是,如果我修改我的代码以下载更大尺寸的相同图像,我的代码将无法正常工作-
def parse(self, response):
img = imageItems()
image_urls = 'https://pixabay.com/en/photos/download/' + response.xpath('//tr[@class="no_default"]/td/input/@value').extract_first()
yield imageItems(image_urls = [image_urls])
在我的最后一个代码中,图片网址是 -
https://pixabay.com/en/photos/download/goose-bird-isolated-feather-1988657.png
但是服务器将该网址转换为一些已处理的 网址 - https://pixabay.com/get/e83cb9072ef1063ecd1f4107ee4d4697e16ae3d111b4134392f3c27e/goose-1988657.png
由于 hased url,我的 scrpy 代码无法正常工作。错误-
2017-01-24 08:25:22 [scrapy] DEBUG: Crawled (200) <GET https://pixabay.com/en/photos/download/goose-1988657.png> (referer: None)
2017-01-24 08:25:22 [scrapy] DEBUG: File (downloaded): Downloaded file from <GET https://pixabay.com/en/photos/download/goose-1988657.png> referred in <None>
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing BmpImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing BufrStubImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing CurImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing DcxImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing DdsImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing EpsImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing FitsStubImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing FliImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing FpxImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing FtexImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing GbrImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing GifImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing GribStubImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing Hdf5StubImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing IcnsImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing IcoImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing ImImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing ImtImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing IptcImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing JpegImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing Jpeg2KImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing McIdasImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing MicImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing MpegImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing MpoImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing MspImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PalmImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PcdImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PcxImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PdfImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PixarImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PngImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PpmImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PsdImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing SgiImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing SpiderImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing SunImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing TgaImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing TiffImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing WebPImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing WmfImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing XbmImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing XpmImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing XVThumbImagePlugin
2017-01-24 08:25:22 [scrapy] ERROR: File (unknown-error): Error processing file from <GET https://pixabay.com/en/photos/download/goose-1988657.png> referred in <None>
Traceback (most recent call last):
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\twisted\internet\defer.py", line 1185, in _inlineCallbacks
result = g.send(result)
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\twisted\internet\defer.py", line 1162, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://pixabay.com/en/photos/download/goose-1988657.png>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\pipelines\files.py", line 339, in media_downloaded
checksum = self.file_downloaded(response, request, info)
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\pipelines\images.py", line 64, in file_downloaded
return self.image_downloaded(response, request, info)
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\pipelines\images.py", line 68, in image_downloaded
for path, image, buf in self.get_images(response, request, info):
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\pipelines\images.py", line 81, in get_images
orig_image = Image.open(BytesIO(response.body))
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\PIL\Image.py", line 2349, in open
% (filename if filename else fp))
OSError: cannot identify image file <_io.BytesIO object at 0x000001D25008A888>
2017-01-24 08:25:22 [scrapy] WARNING: Dropped: Item contains no images
{'image_urls': ['https://pixabay.com/en/photos/download/goose-1988657.png']}
2017-01-24 08:25:22 [scrapy] INFO: Closing spider (finished)
2017-01-24 08:25:22 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 574,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 9486,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'file_count': 1,
'file_status_count/downloaded': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 1, 24, 2, 55, 22, 780851),
'item_dropped_count': 1,
'item_dropped_reasons_count/DropItem': 1,
'log_count/DEBUG': 48,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'log_count/WARNING': 2,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 1, 24, 2, 55, 20, 983794)}
2017-01-24 08:25:22 [scrapy] INFO: Spider closed (finished)
这不是很具体的问题。每次如果服务器为任何图像生成动态 URL,scrapy 就会失败。有没有人遇到过同样类型的问题?