我创建了一个抓取 cookie 的蜘蛛的简单示例。我需要使用 Selenium,因为我还需要 JS 设置的 cookie。
网址列表:
["https://archive.org",
"https://foxnews.com",
"https://spiegel.de",
"https://walmart.com",
"https://asus.comc",
"https://nintendo.com",
"https://americanexpress.com",]
当我在方法中检查 cookie 时parse
,我得到了一个奇怪的结果——对于第一个 url “archive.org”,只有“foxnews.com”cookie,对于“foxnews.com”只有“spiegel.de”cookie 等。
这是蜘蛛:
class Spider(scrapy.Spider):
name = 'simplespider'
urls = ["https://archive.org",
"https://foxnews.com",
"https://spiegel.de",
"https://walmart.com",
"https://asus.comc",
"https://nintendo.com",
"https://americanexpress.com",]
def start_requests(self) -> None:
for url in self.urls:
yield SeleniumRequest(url=url, callback=self.parse)
def parse(self, response, **kwargs) -> None:
driver: WebDriver = response.meta['driver']
cookies = driver.get_cookies()
print(response.request.url)
print(','.join([cookie['domain'] for cookie in cookies]))
我使用Scrapy-Selenium
python模块。
输出(日志除外):
https://archive.org
.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com
https://foxnews.com
.spiegel.de,.spiegel.de,.www.spiegel.de,.www.spiegel.de,.spiegel.de
https://spiegel.de
.www.walmart.com,www.walmart.com,.walmart.com,www.walmart.com,www.walmart.com,www.walmart.com,.walmart.com
你知道它为什么会这样吗?