python - 无法使用scrapy检索信息

Question

我想从http://www.bigcmobiles.in/categories/Mobile-Phones-Smart-Phones/cid-CU00091056.aspx检索手机成本信息。我用过hxs.select('.//div[1]/div/div[1]/div/span/label[2]').extract()，这给了我一个空字典。

你能解释一下这个原因吗？

score 1 · Accepted Answer

问题是该站点上的产品（手机）是通过 XHR 请求动态加载的。您必须在scrapy中对其进行模拟才能获得必要的数据。有关该主题的更多信息，请参阅：

这是你的情况下的蜘蛛。请注意，我从 chrome 开发人员工具，网络选项卡中获得的 url：

from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class BigCMobilesItem(Item):
    title = Field()
    price = Field()


class BigCMobilesSpider(BaseSpider):
    name = "bigcmobile_spider"
    allowed_domains = ["bigcmobiles.in"]
    start_urls = [
        "http://www.bigcmobiles.in/Handler/ProductShowcaseHandler.ashx?ProductShowcaseInput={%22PgControlId%22:1152173,%22IsConfigured%22:true,%22ConfigurationType%22:%22%22,%22CombiIds%22:%22%22,%22PageNo%22:1,%22DivClientId%22:%22ctl00_ContentPlaceHolder1_ctl00_ctl07_Showcase%22,%22SortingValues%22:%22%22,%22ShowViewType%22:%22%22,%22PropertyBag%22:null,%22IsRefineExsists%22:true,%22CID%22:%22CU00091056%22,%22CT%22:0,%22TabId%22:0}&_=1369724967084"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        mobiles = hxs.select("//div[@class='bucket']")
        print mobiles
        for mobile in mobiles:
            item = BigCMobilesItem()
            item['title'] = mobile.select('.//h4[@class="mtb-title"]/text()').extract()[0]
            try:
                item['price'] = mobile.select('.//span[@class="mtb-price"]/label[@class="mtb-ofr"]/text()').extract()[
                    1].strip()
            except:
                item['price'] = 'n/a'
            yield item

将其保存在中spider.py，然后通过scrapy runspider spider.py -o output.json. 然后output.json你会看到：

{"price": "13,999", "title": "Samsung Galaxy S Advance i9070"}
{"price": "9,999", "title": "Micromax A110 Canvas 2"}
{"price": "25,990", "title": "LG Nexus 4 E960"}
{"price": "39,500", "title": "Samsung Galaxy S4 I9500 - Black"}
...

这些是第一页的产品。为了从其他页面获取手机，请查看站点正在使用的 XHR 请求 - 它有PageNo参数 - 看起来像您需要的那样。

希望有帮助。

python - 无法使用scrapy检索信息

1 回答 1

Related

Reference