5

我想爬取这个网站。我写了一个蜘蛛,但它只爬行首页,即前 52 个项目。

我试过这段代码:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
a=[]
from aqaq.items import aqaqItem
import os
import urlparse
import ast

    class aqaqspider(BaseSpider):
        name = "jabong"
        allowed_domains = ["jabong.com"]
        start_urls = [
            "http://www.jabong.com/women/clothing/womens-tops/",
        ]

        def parse(self, response):
            # ... Extract items in the page using extractors
                    n=3
                    ct=1

                    hxs = HtmlXPathSelector(response)
                    sites=hxs.select('//div[@id="page"]')
                    for site in sites:
                            name=site.select('//div[@id="content"]/div[@class="l-pageWrapper"]/div[@class="l-main"]/div[@class="box box-bgcolor"]/section[@class="box-bd pan mtm"]/ul[@id="productsCatalog"]/li/a/@href').extract()
                            print name
                            print ct
                            ct=ct+1
                            a.append(name)
                    req= Request (url="http://www.jabong.com/women/clothing/womens-tops/?page=" + str(n) ,
                    headers = {"Referer": "http://www.jabong.com/women/clothing/womens-tops/",
                            "X-Requested-With": "XMLHttpRequest"},callback=self.parse,dont_filter=True)

                    return req # and your items

它显示以下输出:

2013-10-31 09:22:42-0500 [jabong] DEBUG: Crawled (200) <GET http://www.jabong.com/women/clothing/womens-tops/?page=3> (referer: http://www.jabong.com/women/clothing/womens-tops/)
2013-10-31 09:22:42-0500 [jabong] DEBUG: Filtered duplicate request: <GET http://www.jabong.com/women/clothing/womens-tops/?page=3> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2013-10-31 09:22:42-0500 [jabong] INFO: Closing spider (finished)
2013-10-31 09:22:42-0500 [jabong] INFO: Dumping Scrapy stats:

当我把dont_filter=True它永远不会停止。

4

5 回答 5

5

是的,dont_filter必须在这里使用,因为每次将页面向下滚动到底部时page,XHR 请求中只有 GET 参数会发生变化。http://www.jabong.com/women/clothing/womens-tops/?page=X

现在你需要弄清楚如何停止爬行。这实际上很简单 -只需检查队列中下一页何时没有产品并引发CloseSpider异常

这是一个适用于我的完整代码示例(停在第 234 页):

import scrapy
from scrapy.exceptions import CloseSpider
from scrapy.spider import BaseSpider
from scrapy.http import Request


class Product(scrapy.Item):
    brand = scrapy.Field()
    title = scrapy.Field()


class aqaqspider(BaseSpider):
    name = "jabong"
    allowed_domains = ["jabong.com"]
    start_urls = [
        "http://www.jabong.com/women/clothing/womens-tops/?page=1",
    ]
    page = 1

    def parse(self, response):
        products = response.xpath("//li[@data-url]")

        if not products:
            raise CloseSpider("No more products!")

        for product in products:
            item = Product()
            item['brand'] = product.xpath(".//span[contains(@class, 'qa-brandName')]/text()").extract()[0].strip()
            item['title'] = product.xpath(".//span[contains(@class, 'qa-brandTitle')]/text()").extract()[0].strip()
            yield item

        self.page += 1
        yield Request(url="http://www.jabong.com/women/clothing/womens-tops/?page=%d" % self.page,
                      headers={"Referer": "http://www.jabong.com/women/clothing/womens-tops/", "X-Requested-With": "XMLHttpRequest"},
                      callback=self.parse, 
                      dont_filter=True)
于 2015-04-18T23:18:04.747 回答
2

alecxe你可以试试这个代码,与's 的代码略有不同,

如果没有产品,那么简单地return来自功能并最终导致关闭蜘蛛。简单的解决方案。

import scrapy
from scrapy.exceptions import CloseSpider
from scrapy.spider import Spider
from scrapy.http import Request


class aqaqItem(scrapy.Item):
    brand = scrapy.Field()
    title = scrapy.Field()


class aqaqspider(Spider):
    name = "jabong"
    allowed_domains = ["jabong.com"]
    start_urls = ["http://www.jabong.com/women/clothing/womens-tops/?page=1"]
    page_index = 1

    def parse(self, response):
        products = response.xpath("//li[@data-url]")
        if products:
            for product in products:
                brand = product.xpath(
                    ".//span[contains(@class, 'qa-brandName')]/text()").extract()
                brand = brand[0].strip() if brand else 'N/A'
                title = product.xpath(
                    ".//span[contains(@class, 'qa-brandTitle')]/text()").extract()
                title = title[0].strip() if title else 'N/A'
                item = aqaqItem()
                item['brand']=brand,
                item['title']=title
                yield item
        # here if no products are available , simply return, means exiting from
        # parse and ultimately stops the spider
        else:
            return

        self.page_index += 1
        if page_index:
            yield Request(url="http://www.jabong.com/women/clothing/womens-tops/?page=%s" % (self.page_index + 1),
                          callback=self.parse)

即使蜘蛛产生了超过 12.5k 的产品,它包含很多重复的条目,我已经做了一个ITEM_PIPELINE删除重复的条目并插入到 mongodb 中。

下面的管道代码

from pymongo import MongoClient


class JabongPipeline(object):

    def __init__(self):
        self.db = MongoClient().jabong.product

    def isunique(self, data):
        return self.db.find(data).count() == 0

    def process_item(self, item, spider):
        if self.isunique(dict(item)):
            self.db.insert(dict(item))
        return item

并在此处附加scrapy日志状态

2015-04-19 10:00:58+0530 [jabong] INFO: Dumping Scrapy stats:
       {'downloader/request_bytes': 426231,
        'downloader/request_count': 474,
        'downloader/request_method_count/GET': 474,
        'downloader/response_bytes': 3954822,
        'downloader/response_count': 474,
        'downloader/response_status_count/200': 235,
        'downloader/response_status_count/301': 237,
        'downloader/response_status_count/302': 2,
        'finish_reason': 'finished',
        'finish_time': datetime.datetime(2015, 4, 19, 4, 30, 58, 710487),
        'item_scraped_count': 12100,
        'log_count/DEBUG': 12576,
        'log_count/INFO': 11,
        'request_depth_max': 234,
        'response_received_count': 235,
        'scheduler/dequeued': 474,
        'scheduler/dequeued/memory': 474,
        'scheduler/enqueued': 474,
        'scheduler/enqueued/memory': 474,
        'start_time': datetime.datetime(2015, 4, 19, 4, 26, 17, 867079)}
2015-04-19 10:00:58+0530 [jabong] INFO: Spider closed (finished)
于 2015-04-19T04:49:11.770 回答
0

如果您在该页面上打开开发者控制台,您会看到页面内容在 webrequest 中返回:

http://www.jabong.com/home-living/furniture/new-products/?page=1

这将返回一个包含所有项目的 HTML 文档。因此,我只会增加 page 的值并对其进行解析,直到返回的 HTML 等于先前返回的 HTML。

于 2013-11-01T12:42:07.123 回答
0

dont_filter除非有一些错误响应,否则每次使用和发出新请求确实会永远运行。

在浏览器中进行无限滚动,看看没有更多页面时的响应是什么。然后,在蜘蛛中,通过不发出新请求来处理这种情况。

于 2015-04-16T20:16:20.660 回答
-2
$curl_handle=curl_init();    
curl_setopt($curl_handle,CURLOPT_URL,'http://www.jabong.com/women/clothing/womens-tops/?page=3');    
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0');    
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, array('X-Requested-With: XMLHttpRequest'));    
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
$htmldata = curl_exec($curl_handle);    
curl_close($curl_handle);

它为我工作。请通过 PHP Curl 调用

于 2014-11-18T10:10:14.180 回答