2

我正在探索 Scrapy+Splash 并遇到了SplashRequest不呈现 javascript 并且给出完全相同的响应的问题scrapy.Request。我要抓取的网页是这个。我想要网页中的一些字段用于我的课程项目。

即使在等待 .js 渲染后,我也无法获得最终的 HTML 'wait':'30'。事实上,结果是一样的scrapy.Request。相同的代码适用于我尝试过的另一个网站,即。这个。所以我相信设置很好。

这是蜘蛛定义

import scrapy
from .. import IndeedItem
import scrapy
from scrapy_splash import SplashRequest
from bs4 import BeautifulSoup

class IndeedSpider(scrapy.Spider):
    name = "indeed"
    def __init__(self):
        self.headers = {"Host": "www.naukri.com",
            "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0"}

    def start_requests(self):              

        yield SplashRequest(
            url = "https://www.naukri.com/job-listings-Sr-Python-Developer-Rackspace-Gurgaon-4-to-9-years-270819005015",
            endpoint='render.html', headers = self.headers,
            args={
                    'wait': 3,
                }
            )

    def parse(self, response):
        soup = BeautifulSoup(response.body)
        it = IndeedItem()
        it['job_title'] = soup
        yield it

settings.py (仅相关部分)文件是

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810
}


SPLASH_URL = 'http://localhost:8050/'

输出文件在这里

我不知道输出是什么,它已经嵌入了 JavaScript。在浏览器中打开它会告诉我们渲染的很少(仅标题)。我将如何获得网站的渲染 HTML?任何帮助深表感谢。

4

0 回答 0