1

为了捕获所有重定向路径,包括最终 url 已经被抓取的时间,我编写了一个自定义重复过滤器:

import logging

from scrapy.dupefilters import RFPDupeFilter
from seoscraper.items import RedirectionItem

class CustomURLFilter(RFPDupeFilter):

    def __init__(self, path=None, debug=False):
        super(CustomURLFilter, self).__init__(path, debug)

    def request_seen(self, request):
        request_seen = super(CustomURLFilter, self).request_seen(request)

        if request_seen is True:
            item = RedirectionItem()
            item['sources'] = [ u for u in request.meta.get('redirect_urls', u'') ]
            item['destination'] = request.url

        return request_seen

现在,如何将 RedirectionItem 直接发送到管道?有没有办法从自定义过滤器实例化管道,以便我可以直接发送数据?或者我是否也应该创建一个自定义调度程序并从那里获取管道但是如何?

4

0 回答 0