我用scrapy-redis简单搭建了一个分布式爬虫,从机需要从主队列url中读取url,但是有个问题是我拿到url从机是cPikle转换数据后,我想从redis- url-queue 是正确的,你有什么建议?
例子:
from scrapy_redis.spiders import RedisSpider
from scrapy.spider import Spider
from example.items import ExampleLoader
class MySpider(RedisSpider):
"""Spider that reads urls from redis queue (myspider:start_urls)."""
name = 'redisspider'
redis_key = 'wzws:requests'
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
def parse(self, response):
el = ExampleLoader(response=response)
el.add_xpath('name', '//title[1]/text()')
el.add_value('url', response.url)
return el.load_item()
MySpider 继承了 RedisSpider,当我运行scrapy runspider myspider_redis.py时,它会出现不合法的 url
scrapy-redis github地址:scrapy-redis