我通过docker使用scrapy-splash来访问这个页面:https ://finance.yahoo.com/quote/NFLX/options?p=NFLX
该脚本有效(有点!),但是,返回的页面已class="NoJs chrome featurephone"
指定并且不包含我要提取的所有字段。
这些是我的设置:
# -*- coding: utf-8 -*-
# Scrapy settings for yahoo_options project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'yahoo_options'
SPIDER_MODULES = ['yahoo_options.spiders']
NEWSPIDER_MODULE = 'yahoo_options.spiders'
# ScrapySplash settings
SPLASH_URL = 'http://localhost:8050'
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# set custom dupfilter class for Splash
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/79.0.3945.123 Safari/537.36'
# Obey robots.txt rules
#ROBOTSTXT_OBEY = True
# Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
这是我的脚本:
class yahoo_optionsSpider(scrapy.Spider):
name = 'yahoo_options'
consent_url =
start_urls = ['https://finance.yahoo.com/quote/NFLX/options?p=NFLX']
def parse(self, response):
# get list of contract dates
#options_dates = response.css('div.drop-down-selector > select > option::text').extract()
print(response.text)
最初,我想从下拉菜单中提取选项到期日期,该选项在页面的非 JS 版本中不可用。
我尝试将USER_AGENT
scrapy默认设置更改为上面设置中详述的设置。我还尝试使用 VPN 切换 IP 地址,以试图混淆请求来自 docker 实例。
我已经设法实现了我想要使用HTMLSession()
的 目标,requests_html
并想知道 Scrapy + Splash 是否可以实现同样的目标。
我觉得这可能是由于网站将请求识别为机器人。任何有关如何解决此问题的建议将不胜感激。