python - scrapy-splash 不返回页面的 javascript 版本

Question

我通过docker使用scrapy-splash来访问这个页面：https ://finance.yahoo.com/quote/NFLX/options?p=NFLX

该脚本有效（有点！），但是，返回的页面已class="NoJs chrome featurephone"指定并且不包含我要提取的所有字段。

这些是我的设置：

# -*- coding: utf-8 -*-

# Scrapy settings for yahoo_options project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'yahoo_options'

SPIDER_MODULES = ['yahoo_options.spiders']
NEWSPIDER_MODULE = 'yahoo_options.spiders'

# ScrapySplash settings
SPLASH_URL = 'http://localhost:8050'

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# set custom dupfilter class for Splash
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/79.0.3945.123 Safari/537.36'

# Obey robots.txt rules
#ROBOTSTXT_OBEY = True

# Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

这是我的脚本：

class yahoo_optionsSpider(scrapy.Spider):
    name = 'yahoo_options'
    consent_url = 
    start_urls = ['https://finance.yahoo.com/quote/NFLX/options?p=NFLX']

    def parse(self, response):
        # get list of contract dates
        #options_dates = response.css('div.drop-down-selector > select > option::text').extract()

        print(response.text)

最初，我想从下拉菜单中提取选项到期日期，该选项在页面的非 JS 版本中不可用。

我尝试将USER_AGENTscrapy默认设置更改为上面设置中详述的设置。我还尝试使用 VPN 切换 IP 地址，以试图混淆请求来自 docker 实例。

我已经设法实现了我想要使用HTMLSession()的目标，requests_html并想知道 Scrapy + Splash 是否可以实现同样的目标。

我觉得这可能是由于网站将请求识别为机器人。任何有关如何解决此问题的建议将不胜感激。

python - scrapy-splash 不返回页面的 javascript 版本

0 回答 0

Related

Reference