0

our scraper currently not only downloads text but also images. The scraper in its current state is working fine, we have however big problems with the quality of the downloaded images. After checking the standard ImagePipeline, we implemented a custom one that tells Pillow to use the highest quality, it looks like this (and is configured in settings.py):

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
from cStringIO import StringIO

class CustomImagesPipeline(ImagesPipeline):

    def convert_image(self, image, size=None):
        buf = StringIO()
        image.save(buf, 'JPEG', quality=100)
        return image, buf

We also tried several other presets taken from this file: https://github.com/python-imaging/Pillow/blob/master/PIL/JpegPresets.py

We did however not see any improvements. Did someone here tackle this problem before or has an idea what's wrong with the code?

Thanks :)

4

1 回答 1

4

我已经用另一种方法解决了这个特殊问题,这是通过最近尚未记录的拉取请求实现的。

拉取请求引入了一个名为 FilesPipeline 的新管道: https ://github.com/scrapy/scrapy/blob/master/scrapy/contrib/pipeline/files.py

我必须进行以下更改才能使其正常工作:

  • image_urls将用于图像管道的字段重命名file_urlsitems.py
  • 激活您的管道settings.py并定义一个存储
    • ITEM_PIPELINES = {'scrapy.contrib.pipeline.files.FilesPipeline': 1}
    • FILES_STORE = '/Users/chris/Scrapy/project/images'

除了这些更改之外,管道的工作方式与图像管道完全相同。显然,这种方法仅在您只需要来自网站的原始格式文件时才有效。

于 2013-10-16T07:55:21.443 回答