4

我有一组定义网站结构的链接。从这些链接下载图像时,我想同时将下载的图像放置在类似于网站结构的文件夹结构中,而不仅仅是重命名它(如Scrapy image download how to use custom filename中回答的那样)

我的代码是这样的:

class MyImagesPipeline(ImagesPipeline):
    """Custom image pipeline to rename images as they are being downloaded"""
    page_url=None
    def image_key(self, url):
        page_url=self.page_url
        image_guid = url.split('/')[-1]
        return '%s/%s/%s' % (page_url,image_guid.split('_')[0],image_guid)

    def get_media_requests(self, item, info):
        #http://store.abc.com/b/n/s/m
        os.system('mkdir '+item['sku'][0].encode('ascii','ignore'))
        self.page_url = urlparse(item['start_url']).path #I store the parent page's url in start_url Field
        for image_url in item['image_urls']:
            yield Request(image_url)

它创建了所需的文件夹结构,但是当我深入文件夹时,我发现文件已错放在文件夹中。

我怀疑它正在发生,因为“get_media_requests”和“image_key”函数可能正在异步执行,因此“page_url”的值在“image_key”函数使用之前会发生变化。

4

2 回答 2

1

self您绝对正确,异步 Item 处理可以防止在管道中使用类变量。您必须将路径存储在每个请求中并覆盖更多方法(未经测试):

def image_key(self, url, page_url):
    image_guid = url.split('/')[-1]
    return '%s/%s/%s' % (page_url, image_guid.split('_')[0], image_guid)

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        yield Request(image_url, meta=dict(page_url=urlparse(item['start_url']).path))

def get_images(self, response, request, info):
    key = self.image_key(request.url, request.meta.get('page_url'))
    ...

def media_to_download(self, request, info):
    ...
    key = self.image_key(request.url, request.meta.get('page_url'))
    ...

def media_downloaded(self, response, request, info):
    ...
    try:
        key = self.image_key(request.url, request.meta.get('page_url'))
    ...
于 2012-10-20T22:26:47.360 回答
0

这个scrapy管道扩展提供了一种将下载的文件存储到文件夹树中的简单方法。

你必须安装它:

pip install scrapy_folder_tree

然后,在您的配置中添加管道:

ITEM_PIPELINES = {
    'scrapy_folder_tree.ImagesHashTreePipeline': 300
}

免责声明:我是scrapy-folder-tree的作者

于 2022-02-06T20:05:52.830 回答