当我通过使用 splash 渲染整个目标页面爬取整个网站时出现问题。某些页面不是随机成功的,所以我错误地获取了支持在渲染工作完成时出现的信息。这意味着我只是得到了一部分来自渲染结果的信息虽然我可以从其他渲染结果中获取全部信息。
这是我的代码:
yield SplashRequest(url,self.splash_parse,args = {"wait": 3,},endpoint="render.html")
settings:
SPLASH_URL = 'XXX'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# Enable SplashDeduplicateArgsMiddleware:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# Set a custom DUPEFILTER_CLASS:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter
# a custom cache storage backend:
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'