0

大家好,我一直在努力学习scrapy,现在正在做我的第一个项目。我编写了这段代码来尝试从http://www.rotoworld.com/playernews/nfl/football/?rw=1抓取 NFL 球员新闻。我试图设置一个循环来从站点获取每个容器,但是当我运行代码时它并没有抓取任何东西。代码运行良好,甚至当我问它时会输出一个 csv 文件。它只是没有刮我认为我告诉它刮的东西。任何帮助都会很棒!谢谢

import scrapy
from Roto_Player_News.items import NFLNews

class Roto_News_Spider2(scrapy.Spider):
    name="PlayerNews2"
    allowed_domains = ["rotoworld.com"]
    start_urls = ('http://www.rotoworld.com/playernews/nfl/football/',)

    def parse(self,response):

        containers= response.xpath('//*[@id="cp1_pnlNews"]/div/div[2]')

        def parse(self, response):

            for container in containers:
                def parse(self, response):           
                    item=NFLNews()
                    item['player']= response.xpath('//div[@class="pb"][1]/div[@id="cp1_ctl00_rptBlurbs_floatingcontainer_0"]/div[@class="report"]/text()')
                    item['headline'] = response.xpath('//div[@class="pb"][1]/div[@id="cp1_ctl00_rptBlurbs_floatingcontainer_0"]/div[@class="report"]/p/text()').extract()
                    item['info'] = response.xpath('//div[@class="pb"][1]/div[@id="cp1_ctl00_rptBlurbs_floatingcontainer_0"]/div[@class="impact"]/text()').extract()
                    item['date'] = response.xpath('//div[@class="pb"][1]/div[@id="cp1_ctl00_rptBlurbs_floatingcontainer_0"]/div[@class="info"]/div[@class="date"]/text()').extract()
                    item['source'] = response.xpath('//div[@class="pb"][1]/div[@id="cp1_ctl00_rptBlurbs_floatingcontainer_0"]/div[@class="info"]/div[@class="source"]/a/text()').extract()

                    yield item
4

1 回答 1

0

您定义的 xpath 看起来不太好。试试这个。它应该为您获取您想要抓取的内容。只需进行复制和粘贴。

import scrapy

class Roto_News_Spider2(scrapy.Spider):
    name = "PlayerNews2"

    start_urls = [
        'http://www.rotoworld.com/playernews/nfl/football/',
    ]

    def parse(self, response):
        for item in response.xpath("//div[@class='pb']"):
            player = item.xpath(".//div[@class='player']/a/text()").extract_first()
            report = item.xpath(".//div[@class='report']/p/text()").extract_first()
            date = item.xpath(".//div[@class='date']/text()").extract_first()
            impact = item.xpath(".//div[@class='impact']/text()").extract_first().strip()
            source = item.xpath(".//div[@class='source']/a/text()").extract_first()
            yield {"Player": player,"Report":report,"Date":date,"Impact":impact,"Source":source}
于 2018-06-09T19:21:05.937 回答