python - Scrapy，我正在尝试提取名称和指向网站上个人资料的链接

Question

我是scrapy的新手，我在从站点中提取数据时遇到问题。我相信我有一个逻辑错误，因为我的蜘蛛抓取了页面，但它没有返回任何抓取的数据，我将不胜感激！

rules = (
    Rule(
        SgmlLinkExtractor(
            allow=(r'.*',),
            restrict_xpaths=('//div/div/div/span/a',) #This is the XPath for profiles links that direct to individual pages
        ),
        callback='parse_item',
        follow=True
    ),
      Rule(
        SgmlLinkExtractor(
            allow=(r'.*',),
            restrict_xpaths=('//*[contains(concat(" ", normalize-space(@class), " "), " on ")]',) #This is the XPath that cycles through pages
        ),
        callback='parse_item',
        follow=True
    ),
)

    def parse_item(self, response):
        self.log('parse_item called for: %s' % response.url, level=log.INFO)
        hxs = HtmlXPathSelector(response)
        item = RealtorSpiderItem()
        item['name'] = hxs.select('//*[contains(concat(" ", normalize-space(@class), " "), " screenname ")]').extract()
        item['link'] = hxs.select('@href').extract()
        item['city'] = hxs.select('//*[contains(concat(" ", normalize-space(@class), " "), " locality ")]').extract()

        return item

score 0 · Accepted Answer

在爬虫中，您使用规则来查找页面并在每次匹配时start_urls触发。parse_item()

我想你想这样做：

rules = (
    Rule(
        SgmlLinkExtractor(
            restrict_xpaths=('//div/div/div/span/a'), 
            callback='parse_item')
    ),
)

因此，只有一个规则可以找到start_urls页面内的链接，并为每个匹配运行parse_item()。

请参阅CrawlSpider 示例

python - Scrapy，我正在尝试提取名称和指向网站上个人资料的链接

1 回答 1

Related

Reference