1

我有一个项目将填写在每个解析函数中。我想在解析完成后返回更新的项目。这是我的场景:

我的物品类别:

class MyItem(Item):

    name = Field()
    links1 = Field()
    links2 = Field()

登录后我有多个要抓取的网址:

在解析函数中,我正在做:

for url in urls:
    yield Request(url=url, callback=self.get_info)

在 get_info 中,我将在每个响应中提取“名称”和“链接”:

item = MyItem()
item['name'] = hxs.select("//title/text()").extract()
links = []
link = {}
for data in json_parsed_from_response:
    link['name'] = data.get('name')
    link['url'] = data.get('url')
    links.append(link)
item['links1] = links

#similarly, item['links2'] is created.

现在,我想遍历每个 item['links1] 和 item['links2'] 中的每个 url(这些循环在 get_info 中):

for link in item['links1']:
    request = Request(url= link['url'], callback=self.get_status)
    request.meta['link'] = link
    yield request

for link in item['links2']:
    request = Request(url= link['url'], callback=self.get_status)
    request.meta['link'] = link
    yield request

 # Where do I return item, can't return item inside generator

def get_status(self, response):

    link = response.meta['link']
    if "good" in response.body:
        link['status'] = 'good'
    else:
        link['status'] = 'bad'

    # Changes made here, will be reflected in item? 
    # Also, I can't return item from here. Multiple items will be returned.

我不知道从哪里item返回,它应该有所有更新的数据。

4

1 回答 1

0

抱歉,但除非您提供更多详细信息,否则我无法理解您的代码设计,因此我无能为力...我最好的建议是创建一个 * MyItem * 列表并附加您的每个项目创建到该列表。当您更改它们时,这些值应该会发生变化。因此,您应该能够遍历列表并查看更新的项目。

于 2013-09-25T09:46:41.540 回答