我正在尝试抓取服装网站,scrapy_selenium
但出现以下错误:
参数“wait_time”有多个值
当我从中删除所有参数时SeleniumRequest
,除此之外url
,callback=self.parse
我得到了这个新错误:
TypeError:init()缺少1个必需的位置参数:'url'
我已经搜索过它可能是 chromedriver 路径,但是我看到的链接是selenium
而不是scrapy_selenium
,所以我认为问题可能不同?
例如,我正在运行这个脚本:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy_selenium import SeleniumRequest
class EtsySpider(scrapy.Spider):
name = 'Etsy_test'
start_urls = ['https://www.etsy.com/search/clothing/womens-clothing?q=30s&explicit=1&locationQuery=2635167&ship_to=GB']
def start_requests(self):
for url in self.start_urls:
yield SeleniumRequest(
url,
wait_time = 3,
screenshot = True,
callback = self.parse,
dont_filter = True
)
def parse(self, response):
stuff = response.xpath("//body/main[@role='main']/div/div/div/div/div/div/div/div/div/div[1]/ul[1]")
for links in stuff:
yield {
'stuff':links
}
process = CrawlerProcess(
settings = {
'FEED_URI':'clothes.jl',
'FEED_FORMAT':'jsonlines'
}
)
我还得到以下信息:
2022-01-04 13:01:22 [selenium.webdriver.remote.remote_connection] 调试:删除 http://localhost:56701/session/1773c683cbe7b50aa0c64eea666c4ea9 {} 2022-01-04 13:01:22 [urllib3.connectionpool]调试:http://localhost:56701“删除/session/1773c683cbe7b50aa0c64eea666c4ea9 HTTP/1.1”200 14
我的设置如下:
BOT_NAME = 'test_two'
SPIDER_MODULES = ['test_two.spiders']
NEWSPIDER_MODULE = 'test_two.spiders'
ROBOTSTXT_OBEY = True
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['--headless']
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}