我正在练习 和 的集成Playwright
,Scrapy
但是,我的刮刀只会返回一个项目。我不确定我是否有xpath
错?因为我得到以下输出:
2022-01-04 21:41:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobsite.co.uk/jobs/Degree-Accounting-and-Finance>
{'items': 'Up to £26,000 per annum'}
我正在尝试从动态网站上获取薪水,这是我尝试过的脚本:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy_playwright.page import PageCoroutine
from scrapy.item import Field
from itemloaders.processors import TakeFirst
from scrapy.loader import ItemLoader
class EtsyItem(scrapy.Item):
items = Field(output_processor = TakeFirst())
class EtsySpider(scrapy.Spider):
name = 'job'
start_urls = ['https://www.jobsite.co.uk/jobs/Degree-Accounting-and-Finance']
custom_settings = {
'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15'
}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url = url,
callback = self.parse,
meta= dict(
playwright = True,
playwright_include_page = True,
playwright_page_coroutines = [
PageCoroutine('wait_for_selector', 'div.row.job-results-row')
]
)
)
def parse(self, response):
stuff = response.xpath("//div[@class='ResultsSectionContainer-sc-gdhf14-0 kteggz']")
for items in stuff:
loaders = ItemLoader(EtsyItem(), selector = items)
loaders.add_xpath('items', '//dl[normalize-space()]//text()')
yield loaders.load_item()
if __name__ == "__main__":
process = CrawlerProcess(settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}, })
process.crawl(EtsySpider)
process.start()