javascript - scrapy-splash 渲染多于第一页

Question

我正在尝试抓取一个网站，但需要在所有页面中使用 splash，因为它们的内容是动态创建的。现在它只呈现第一页，而不是内容页面或分页页面。

这是代码：

import scrapy
from scrapy_splash import SplashRequest
import scrapy_splash

class ShutSpider(scrapy.Spider):
    name = 'Shut'
    def start_requests(self):
            yield SplashRequest(url='ROOTURL',callback=self.parse)

    def parse(self, response):
        # follow links to author pages
        content=response.xpath('//*[@id="iconQuesBar"]/a[4]/@href').extract()
        for href in content:
            yield response.follow(href.replace('?id=', ''), self.parse_QNA)
        if content == []:
            return
        # follow pagination links
        for href in response.xpath('//*[@id="body-div"]/table/tbody/tr[2]/td[3]/center/form/span/a/@href').extract():
            yield response.follow(href, self.parse)

    def parse_QNA(self, response):
        yield {
            'url': response.url,
            'title': response.xpath('//h1[@class = "head"]/text()').extract()

我已经以我能想到的各种方式玩了它，但它没有用。我现在能想到的唯一解决方案是使用渲染 API 将链接发送到内容页面和分页，但我认为这是非常糟糕的编码，必须有另一种方式。

谢谢你的帮助。

score 1 · Accepted Answer

而不是，为后续页面response.follow()显式地生成新的。SplashRequest此外，您必须response.urljoin()在这种情况下使用。这是修改后的代码：

import scrapy
from scrapy_splash import SplashRequest
import scrapy_splash

class ShutSpider(scrapy.Spider):
    name = 'Shut'
    def start_requests(self):
        yield SplashRequest(url='ROOTURL',callback=self.parse)

    def parse(self, response):
        # follow links to author pages
        content = response.xpath('//*[@id="iconQuesBar"]/a[4]/@href').extract()
        for href in content:
            yield SplashRequest(response.urljoin(href.replace('?id=', '')), self.parse_QNA)
        if content == []:
            return
        # follow pagination links
        for href in response.xpath('//*[@id="body-div"]/table/tbody/tr[2]/td[3]/center/form/span/a/@href').extract():
            yield SplashRequest(response.urljoin(href), self.parse)

    def parse_QNA(self, response):
        yield {
            'url': response.url,
            'title': response.xpath('//h1[@class = "head"]/text()').extract()

javascript - scrapy-splash 渲染多于第一页

1 回答 1

Related

Reference