0

我创建了一个抓取 cookie 的蜘蛛的简单示例。我需要使用 Selenium,因为我还需要 JS 设置的 cookie。

网址列表:

               ["https://archive.org",
                "https://foxnews.com",
                "https://spiegel.de",
                "https://walmart.com",
                "https://asus.comc",
                "https://nintendo.com",
                "https://americanexpress.com",]

当我在方法中检查 cookie 时parse,我得到了一个奇怪的结果——对于第一个 url “archive.org”,只有“foxnews.com”cookie,对于“foxnews.com”只有“spiegel.de”cookie 等。

这是蜘蛛:

class Spider(scrapy.Spider):
    name = 'simplespider'
    urls =    ["https://archive.org",
                    "https://foxnews.com",
                    "https://spiegel.de",
                    "https://walmart.com",
                    "https://asus.comc",
                    "https://nintendo.com",
                    "https://americanexpress.com",]
    def start_requests(self) -> None:
        for url in self.urls:
            yield SeleniumRequest(url=url, callback=self.parse)

    def parse(self, response, **kwargs) -> None:
        driver: WebDriver = response.meta['driver']
        cookies = driver.get_cookies()
        print(response.request.url)
        print(','.join([cookie['domain'] for cookie in cookies]))

我使用Scrapy-Seleniumpython模块。

输出(日志除外):

https://archive.org
.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com,.foxnews.com
https://foxnews.com
.spiegel.de,.spiegel.de,.www.spiegel.de,.www.spiegel.de,.spiegel.de
https://spiegel.de
.www.walmart.com,www.walmart.com,.walmart.com,www.walmart.com,www.walmart.com,www.walmart.com,.walmart.com

你知道它为什么会这样吗?

4

0 回答 0