python - 使用 scrapy 抓取网站

Question

我正在尝试用scrapy抓取一个网站，但我无法从这个网站上抓取所有产品，因为它正在使用无休止的滚动......

我只能抓取 52 个项目的数据，但它们是 3824 个项目。

hxs.select("//span[@class='itm-Catbrand strong']").extract()
hxs.select("//span[@class='itm-price ']").extract()
hxs.select("//span[@class='itm-title']").extract()

如果我使用hxs.select("//div[@id='content']/div/div/div").extract()Then 它会提取整个项目列表，但不会进一步过滤....如何抓取所有项目？

我已经尝试过了，但结果相同。我哪里错了？

def parse(self, response):
    filename = response.url.split("/")[-2]
    open(filename, 'wb').write(response.body
    for n in [2,3,4,5,6]:            
    req = Request(url="http://www.jabong.com/men/shoes/?page=" + n,
                      headers = {"Referer": "http://www.jabong.com/men/shoes/",
                                 "X-Requested-With": response.header['X-Requested-With']})
    return req

score 5 · Accepted Answer

正如您所猜到的，当您滚动页面时，该网站使用 javascript 来加载更多项目。

使用我的浏览器中包含的开发人员工具（Ctrl-Maj i for chromium），我在 Network 选项卡中看到页面中包含的 javascript 脚本执行以下请求以加载更多项目：

GET http://www.website-your-are-crawling.com/men/shoes/?page=2 # 2,3,4,5,6 etc...

Web 服务器使用以下类型的文档进行响应：

<li id="PH969SH70HPTINDFAS" class="itm hasOverlay unit size1of4 ">
  <div id="qa-quick-view-btn" class="quickviewZoom itm-quickview ui-buttonQuickview l-absolute pos-t" title="Quick View" data-url ="phosphorus-Black-Moccasins-233629.html" data-sku="PH969SH70HPTINDFAS" onClick="_gaq.push(['_trackEvent', 'BadgeQV','Shown','OFFER INSIDE']);">Quick view</div>

                                    <div class="itm-qlInsert tooltip-qlist  highlightStar"
                     onclick="javascript:Rocket.QuickList.insert('PH969SH70HPTINDFAS', 'catalog');
                                             return false;" >
                                              <div class="starHrMsg">
                         <span class="starHrMsgArrow">&nbsp;</span>
                         Save for later                         </div>
                                        </div>
                <a id='cat_105_PH969SH70HPTINDFAS' class="itm-link sobrTxt" href="/phosphorus-Black-Moccasins-233629.html" 
                                    onclick="fireGaq('_trackEvent', 'Catalog to PDP', 'men--Shoes--Moccasins', 'PH969SH70HPTINDFAS--1699.00--', this),fireGaq('_trackEvent', 'BadgePDP','Shown','OFFER INSIDE', this);">
                    <span class="lazyImage">
                        <span style="width:176px;height:255px;" class="itm-imageWrapper itm-imageWrapper-PH969SH70HPTINDFAS" id="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" itm-img-width="176" itm-img-height="255" itm-img-sprites="4">
                            <noscript><img src="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" width="176" height="255" class="itm-img"></noscript>
                        </span>                            
                    </span>

                                            <span class="itm-budgeFlag offInside"><span class="flagBrdLeft"></span>OFFER INSIDE</span>                       
                                            <span class="itm-Catbrand strong">Phosphorus</span>
                    <span class="itm-title">
                                                                                Black Moccasins                        </span>

这些文档包含更多项目。

因此，要获得完整的项目列表，您必须在 Spider 的方法中返回Request对象parse（请参阅Spider 类文档），以告诉 scrapy 它应该加载更多数据：

def parse(self, response):
    # ... Extract items in the page using extractors
    n = number of the next "page" to parse
    # You get get n by using response.url, extracting the number
    # at the end and adding 1

    # It is VERY IMPORTANT to set the Referer and X-Requested-With headers
    # here because that's how the website detects if the request was made by javascript
    # or direcly by following a link.
    req = Request(url="http://www.website-your-are-crawling.com/men/shoes/?page=" + n,
       headers = {"Referer": "http://www.website-your-are-crawling.com/men/shoes/",
          "X-Requested-With": "XMLHttpRequest"})
    return req # and your items

哦，顺便说一下（如果你想测试），你不能只http://www.website-your-are-crawling.com/men/shoes/?page=2在浏览器中加载以查看它返回的内容，因为http://www.website-your-are-crawling.com/men/shoes/如果X-Requested-With标题不同于XMLHttpRequest.

python - 使用 scrapy 抓取网站

1 回答 1

Related

Reference