python - 使用 scrapy_selenium 进行抓取：为参数“wait_time”获取多个值

Question

我正在尝试抓取服装网站，scrapy_selenium但出现以下错误：

参数“wait_time”有多个值

当我从中删除所有参数时SeleniumRequest，除此之外url，callback=self.parse我得到了这个新错误：

TypeError：init（）缺少1个必需的位置参数：'url'

我已经搜索过它可能是 chromedriver 路径，但是我看到的链接是selenium而不是scrapy_selenium，所以我认为问题可能不同？

例如，我正在运行这个脚本：

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy_selenium import SeleniumRequest

class EtsySpider(scrapy.Spider):
    name = 'Etsy_test'
    start_urls = ['https://www.etsy.com/search/clothing/womens-clothing?q=30s&explicit=1&locationQuery=2635167&ship_to=GB']
    def start_requests(self):
        for url in self.start_urls:
            yield SeleniumRequest(
                url,
                wait_time = 3, 
                screenshot = True,
                callback = self.parse,
                dont_filter = True
            )
    def parse(self, response):
        stuff = response.xpath("//body/main[@role='main']/div/div/div/div/div/div/div/div/div/div[1]/ul[1]")
        for links in stuff:
            yield {
                'stuff':links
            }
process = CrawlerProcess(
    settings = {
        'FEED_URI':'clothes.jl',
        'FEED_FORMAT':'jsonlines'
    }
)

我还得到以下信息：

2022-01-04 13:01:22 [selenium.webdriver.remote.remote_connection] 调试：删除 http://localhost:56701/session/1773c683cbe7b50aa0c64eea666c4ea9 {} 2022-01-04 13:01:22 [urllib3.connectionpool]调试：http://localhost:56701“删除/session/1773c683cbe7b50aa0c64eea666c4ea9 HTTP/1.1”200 14

我的设置如下：

BOT_NAME = 'test_two'

SPIDER_MODULES = ['test_two.spiders']
NEWSPIDER_MODULE = 'test_two.spiders'

ROBOTSTXT_OBEY = True

from shutil import which
  
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['--headless']  
  
DOWNLOADER_MIDDLEWARES = {
     'scrapy_selenium.SeleniumMiddleware': 800
     }

python - 使用 scrapy_selenium 进行抓取：为参数“wait_time”获取多个值

0 回答 0

Related

Reference