python - 如何使用 ItemLoaders 将数据添加到类似 dict 的项目字段？

Question

我正在使用 Scrapy 的 XPathItemLoader，但它只是将值添加到项目字段的 api 文档，但没有更深入:( 我的意思是：

def parse_item(self, response):
    loader = XPathItemLoader(response=response)
    loader.add_xpath('name', '//h1')

将 xpath 找到的值添加到Item.name，但是如何将它们添加到Item.profile['name']？

score 2 · Accepted Answer

XPathItemLoader.add_xpath不支持写入嵌套字段。您应该profile手动构建您的 dict 并通过方法编写它add_value（以防您仍然需要使用加载器）。或者，您可以编写自己的自定义加载程序。

这是一个使用示例add_value：

from scrapy.contrib.loader import XPathItemLoader
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider


class TestItem(Item):
    others = Field()


class WikiSpider(BaseSpider):
    name = "wiki"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Main_Page"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=TestItem(), response=response)

        others = {}
        crawled_items = hxs.select('//div[@id="mp-other"]/ul/li/b/a')
        for item in crawled_items:
            href = item.select('@href').extract()[0]
            name = item.select('text()').extract()[0]
            others[name] = href

        loader.add_value('others', others)
        return loader.load_item()

通过以下方式运行它：scrapy runspider <script_name> --output test.json.

蜘蛛从主维基百科页面收集项目Other areas of Wikipedia并将其写入字典字段others。

希望有帮助。

score 0 · Accepted Answer

这是默认设置scrapy.loader.Itemloader：

class ItemLoader(object):

    default_item_class = Item
    default_input_processor = Identity()
    default_output_processor = Identity()
    default_selector_class = Selector

当你使用时add_value add_xpath add_css，输入和输出处理器是Identify()，这意味着什么都不做。所以你可以使用add value：

name = response.xpath('//h1/text()').extract_first()
loader.add_value('profile', {'name':name})

python - 如何使用 ItemLoaders 将数据添加到类似 dict 的项目字段？

2 回答 2

Related

Reference