python - 无法在 Scrapy 中关注链接

Question

我现在开始使用 Scrapy，我知道如何从运动页面（足球运动员的姓名和球队）获取我想要的内容，但我需要按照链接搜索更多球队，每个球队页面都有一个链接到玩家页面，网站链接的结构是：

球队页面：http ://esporte.uol.com.br/futebol/clubes/vitoria/ 球员页面：http ://esporte.uol.com.br/futebol/clubes/vitoria/jogadores/

我已经阅读了一些 Scrapy 教程，我在想团队页面我必须关注链接并且不解析任何内容，而玩家页面我必须不关注并解析玩家，我不知道我是否'我对这个想法和语法错误，如果我的跟随想法是错误的，欢迎任何帮助。

这是我的代码：

class MoneyballSpider(BaseSpider):
    name = "moneyball"
    allowed_domains = ["esporte.uol.com.br", "click.uol.com.br", "uol.com.br"]
    start_urls = ["http://esporte.uol.com.br/futebol/clubes/vitoria/jogadores/"]

    rules = (
        Rule(SgmlLinkExtractor(allow=(r'.*futebol/clubes/.*/', ), deny=(r'.*futebol/clubes/.*/jogadores/', )), follow = True),
        Rule(SgmlLinkExtractor(allow=(r'.*futebol/clubes/.*/jogadores/', )), callback='parse', follow = True),
        )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        jogadores = hxs.select('//div[@id="jogadores"]/div/ul/li')
        items = []
        for jogador in jogadores:
            item = JogadorItem()
            item['nome'] = jogador.select('h5/a/text()').extract()
            item['time'] = hxs.select('//div[@class="header clube"]/h1/a/text()').extract()
            items.append(item)
            print item['nome'], item['time']
        return items

score 7 · Accepted Answer

首先，由于您需要关注提取链接，因此您需要一个CrawlSpider而不是BaseSpider. 然后，您需要定义两条规则：一条适用于有回调的玩家，一条适用于没有回调的球队。此外，您应该从包含团队列表的 URL 开始，例如http://esporte.uol.com.br/futebol。这是一个完整的蜘蛛，它返回来自不同球队的球员：

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector


class JogadorItem(Item):
    nome = Field()
    time = Field()


class MoneyballSpider(CrawlSpider):
    name = "moneyball"
    allowed_domains = ["esporte.uol.com.br", "click.uol.com.br", "uol.com.br"]
    start_urls = ["http://esporte.uol.com.br/futebol"]

    rules = (Rule(SgmlLinkExtractor(allow=(r'.*futebol/clubes/.*?/jogadores/', )), callback='parse_players', follow=True),
             Rule(SgmlLinkExtractor(allow=(r'.*futebol/clubes/.*', )), follow=True),)

    def parse_players(self, response):
        hxs = HtmlXPathSelector(response)
        jogadores = hxs.select('//div[@id="jogadores"]/div/ul/li')
        items = []
        for jogador in jogadores:
            item = JogadorItem()
            item['nome'] = jogador.select('h5/a/text()').extract()
            item['time'] = hxs.select('//div[@class="header clube"]/h1/a/text()').extract()
            items.append(item)
            print item['nome'], item['time']
        return items

从输出中引用：

...
[u'Silva'] [u'Vila Nova-GO']
[u'Luizinho'] [u'Vila Nova-GO']
...
[u'Michel'] [u'Guarani']
[u'Wellyson'] [u'Guarani']
...

这只是提示您继续使用蜘蛛，您需要进一步调整蜘蛛：根据您的需要选择合适的起始 URL 等。

希望有帮助。

python - 无法在 Scrapy 中关注链接

1 回答 1

Related

Reference