python - 不确定用 Scrapy 迭代什么

Question

我在使用 scrapy 迭代爬网时遇到问题。我正在提取一个标题字段和一个内容字段。问题是我得到了一个 JSON 文件，其中列出了所有标题，然后是所有内容。我想得到 {title}、{content}、{title}、{content}，这意味着我可能必须遍历 parse 函数。问题是我无法弄清楚我在循环什么元素（即，for x in [???]）这是代码：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import SitemapSpider

from Foo.items import FooItem


class FooSpider(SitemapSpider):
    name = "foo"
    sitemap_urls = ['http://www.foo.com/sitemap.xml']
    #sitemap_rules = [


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = [
        item = FooItem()
        item['title'] = hxs.select('//span[@class="headline"]/text()').extract()
        item['content'] = hxs.select('//div[@class="articletext"]/text()').extract()
        items.append(item)
        return items

score 2 · Accepted Answer

您的 xpath 查询返回页面上的所有标题和所有内容。我想你可以这样做：

titles = hxs.select('//span[@class="headline"]/text()').extract()
contents = hxs.select('//div[@class="articletext"]/text()').extract()

for title, context in zip(titles, contents):
    item = FooItem()
    item['title'] = title
    item['content'] = context
    yield item

但这并不可靠。尝试执行返回块的 xpath 查询title和content内部。如果您向我展示 xml 源代码，我会帮助您。

blocks = hxs.select('//div[@class="some_filter"]')
for block in blocks:
    item = FooItem()
    item['title'] = block.select('span[@class="headline"]/text()').extract()
    item['content'] = block.select('div[@class="articletext"]/text()').extract()
    yield item

我不确定 xpath 查询，但我认为想法很清楚。

score 0 · Accepted Answer

你不需要HtmlXPathSelector。Scrapy 已经内置了 XPATH 选择器。尝试这个：

blocks = response.xpath('//div[@class="some_filter"]')
for block in blocks:
    item = FooItem()
    item['title'] = block.xpath('span[@class="headline"]/text()').extract()[0]
    item['content'] = block.xpath('div[@class="articletext"]/text()').extract()[0]
    yield item

python - 不确定用 Scrapy 迭代什么

2 回答 2

Related

Reference