python - 我的第一个 scrapy xpath 选择器

Question

我对此很陌生，并且一直在尝试了解我的第一个选择器。有人可以帮助我吗？我正在尝试从此页面中提取数据：

http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc- -ghs-d1- -asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false

div class = Listing clearfix ShelfListing 下的所有信息，但我似乎无法弄清楚如何格式化response.xpath()。

我已经设法启动了scrapy控制台，但无论我输入什么，response.xpath()我似乎都无法选择正确的节点。我知道它有效，因为当我输入

>>>response.xpath('//div[@class="container"]')

我得到回应。然而，我不知道如何导航到清单 cleardix 货架清单。我希望一旦我得到这一点，我就可以继续通过蜘蛛工作。

PS我想知道是否无法扫描此站点-所有者是否可以阻止蜘蛛？

score 4 · Accepted Answer

divwithlistings类 (and ) 中的内容id通过 XHR 请求异步加载。换句话说，Scrapy获取的 html 代码不包含它：

$ scrapy shell http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false
>>> response.xpath('//div[@id="listings"]')
[]

使用浏览器开发人员工具，您可以看到带有一堆 GET 参数的请求转到http://groceries.asda.com/api/items/viewitemlist url。

一种选择是模拟该请求并解析生成的 JSON：

在此处输入图像描述

如何做到这一点实际上是另一个问题的一部分。

这是使用selenium包的一种可能的解决方案：

from selenium import webdriver

driver = webdriver.Firefox()
driver.get('http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false')

div = driver.find_element_by_id('listings')
for item in driver.find_elements_by_xpath('//div[@id="listings"]//a[@title]'):
    print item.text.strip()

driver.close()

印刷：

Kellogg's Coco Pops
Kelloggs Rice Krispies
Kellogg's Coco Pops Croco Copters
...

python - 我的第一个 scrapy xpath 选择器

1 回答 1

Related

Reference