python - scrapy-splash xpath 选择器在 shell 中工作，但不适用于蜘蛛

Question

问题：
我正在使用 scrapy-splash 来抓取 youtube 视频页面。但是，除了元素之外，xpath 似乎没有返回任何keywords元素。（xpath都是直接从Chrome复制的）

我尝试过的事情：
起初我认为这是因为调用 parse 时页面尚未完全加载，所以我更改了 SplashRequest 的等待参数，但它没有帮助。我还从启动 GUI ( http://localhost:8050 )下载了 html 响应的副本，并验证 xpath/selectors 在下载的副本上都可以正常工作。在这里，我假设这个 html 正是 scrapy 在 parse 中看到的，所以我无法理解为什么它不能在 scrapy 脚本中工作。

我还尝试了scrapy shell，使用它，一切正常：
scrapy shell 'http://localhost:8050/render.html?url=https://www.youtube.com/watch?v=HOfTrhmIXIM&wait=2.0'

回复：

response.xpath('//*[@id="container"]/h1/yt-formatted-string/text()').extract_first(default='')                                                 
Out[2]: 'Scraping, analyzing youtube channel data with python'

代码：
这是我的代码：

class videoSpider(scrapy.Spider):
name = "videoSpider"
start_urls = ["https://www.youtube.com/watch?v=HOfTrhmIXIM"]

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url=url, callback=self.parse, args={"wait":5})

def parse(self, response):
    item = YoutubeVideoItem()
    #print(response.text)
    item['keywords'] = response.xpath('/html/head/meta[@name="keywords"]/@content').extract_first(default='')
    item['title'] = response.xpath('//*[@id="container"]/h1/yt-formatted-string').extract_first(default='')
    item['category'] = response.xpath('//*[@id="content"]/yt-formatted-string/a').extract_first(default='')
    item['visualizations'] = response.xpath('//*[@id="count"]/yt-view-count-renderer/span[1]').extract_first(default='')
    item['publication_data'] = response.xpath('//*[@id="date"]/yt-formatted-string').extract_first(default='')
    yield item

score -2 · Accepted Answer

-2

显然，您缺少 start_urls 上方的 allowed_domains 列表

我希望我有帮助

于 2020-05-15T19:20:26.050 回答

python - scrapy-splash xpath 选择器在 shell 中工作，但不适用于蜘蛛

1 回答 1

Related

Reference