python - Scrapy null 输出，但数据被抓取

Question

我正在抓取一个网站并尝试将输出保存在 MongoDB 中。它看到代码是好的，但是当我尝试一个简单的输出（scrapy crawl IR -o items.json -t json）时，文件出来是空白的......但是蜘蛛的日志显示数据被抓取了......

这是我的蜘蛛代码

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from teste.items import IngressoRapidoItem

class IngressoRapidoSpider(BaseSpider):
   name = "IR"
   allowed_domains = ["ingressorapido.com.br"]
   start_urls = (
        'http://www.ingressorapido.com.br/eventos.aspx?genero=55',
         )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        item = IngressoRapidoItem()
        item['banda'] = hxs.select('normalize-space(//a[contains(@href,"Evento")]    /text())').extract()
        item['local'] = hxs.select('normalize-space(//td/span[contains(@style,     "normal")]/text())').extract()
        items.append(item)
        return items

任何人都猜到为什么即使数据已被抓取，输出仍为空？提前致谢

score 0 · Accepted Answer

运行上面发布的代码后，我可以确认数据已被抓取，但很难说这些数据是否真的有用，因为只创建了一个带有场地但没有事件名称的项目。

我稍微修改了 xpath 代码，并且能够为http://www.ingressorapido.com.br/eventos.aspx?genero=55. 然后，我可以毫无问题地将抓取的数据写入 json 文件。

如果您有任何问题，或者 xpath 代码没有返回所需的数据，请告诉我。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from teste.items import IngressoRapidoItem

class IngressoRapidoSpider(BaseSpider):
    name = "IR"
    allowed_domains = ["ingressorapido.com.br"]
    start_urls = (
        'http://www.ingressorapido.com.br/eventos.aspx?genero=55',
         )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        events = hxs.select('//table[@id="ContentPlaceHolder1_dlEventos"]//table//td[2]')
        items = []
        for e in events:
            item = IngressoRapidoItem()
            item['banda'] = e.select('normalize-space(.//a//text())').extract()
            item['local'] = e.select('normalize-space(.//span//text())').extract()
            items.append(item)
        return items

python - Scrapy null 输出，但数据被抓取

1 回答 1

Related

Reference