我有一个项目将填写在每个解析函数中。我想在解析完成后返回更新的项目。这是我的场景:
我的物品类别:
class MyItem(Item):
name = Field()
links1 = Field()
links2 = Field()
登录后我有多个要抓取的网址:
在解析函数中,我正在做:
for url in urls:
yield Request(url=url, callback=self.get_info)
在 get_info 中,我将在每个响应中提取“名称”和“链接”:
item = MyItem()
item['name'] = hxs.select("//title/text()").extract()
links = []
link = {}
for data in json_parsed_from_response:
link['name'] = data.get('name')
link['url'] = data.get('url')
links.append(link)
item['links1] = links
#similarly, item['links2'] is created.
现在,我想遍历每个 item['links1] 和 item['links2'] 中的每个 url(这些循环在 get_info 中):
for link in item['links1']:
request = Request(url= link['url'], callback=self.get_status)
request.meta['link'] = link
yield request
for link in item['links2']:
request = Request(url= link['url'], callback=self.get_status)
request.meta['link'] = link
yield request
# Where do I return item, can't return item inside generator
def get_status(self, response):
link = response.meta['link']
if "good" in response.body:
link['status'] = 'good'
else:
link['status'] = 'bad'
# Changes made here, will be reflected in item?
# Also, I can't return item from here. Multiple items will be returned.
我不知道从哪里item
返回,它应该有所有更新的数据。