21

我想废弃http://www.3andena.com/,该网站首先以阿拉伯语开头,并将语言设置存储在 cookie 中。如果您尝试通过 URL ( http://www.3andena.com/home.php?sl=en )直接访问语言版本,则会出现问题并返回服务器错误。

所以,我想将 cookie 值“store_language”设置为“en”,然后使用这个 cookie 值开始废弃网站。

我正在使用带有几个规则的 CrawlSpider。

这是代码

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
from bkam.items import Product
from scrapy.http import Request
import re

class AndenaSpider(CrawlSpider):
  name = "andena"
  domain_name = "3andena.com"
  start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"]

  product_urls = []

  rules = (
     # The following rule is for pagination
     Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True),
     # The following rule is for produt details
     Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True),
     )

  def start_requests(self):
    yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'})

    for url in self.start_urls:
        yield Request(url, callback=self.parse_category)


  def parse_category(self, response):
    hxs = HtmlXPathSelector(response)

    self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract())

    for product in self.product_urls:
        yield Request(product, callback=self.parse_product)  


  def parse_product(self, response):
    hxs = HtmlXPathSelector(response)
    items = []
    item = Product()

    '''
    some parsing
    '''

    items.append(item)
    return items

SPIDER = AndenaSpider()

这是日志:

2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en>
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098>
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None)
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10)
4

3 回答 3

13

修改您的代码如下:

def start_requests(self):
    for url in self.start_urls:
        yield Request(url, cookies={'store_language':'en'}, callback=self.parse_category)

Scrapy.Request 对象接受可选cookies的关键字参数,请参阅此处的文档

于 2014-06-08T13:02:01.480 回答
10

从 Scrapy 0.24.6 开始,我就是这样做的:

from scrapy.contrib.spiders import CrawlSpider, Rule

class MySpider(CrawlSpider):

    ...

    def make_requests_from_url(self, url):
        request = super(MySpider, self).make_requests_from_url(url)
        request.cookies['foo'] = 'bar'
        return request

使用蜘蛛属性中make_requests_from_url的 URL 进行Scrapy 调用。start_urls上面的代码所做的是让默认实现创建请求,然后添加一个foo具有 value 的 cookie bar。(或者如果发生这种情况,将 cookie 更改为值bar,无论如何,默认实现生成的请求中已经有一个foocookie。)

如果您想知道不是从创建的请求会发生什么start_urls,让我补充一下,Scrapy 的 cookie 中间件将记住使用上面代码设置的 cookie,并将其设置在与您明确添加的请求共享同一域的所有未来请求上你的饼干。

于 2015-06-02T15:08:18.653 回答
4

直接来自请求和响应的 Scrapy 文档。

你需要这样的东西

request_with_cookies = Request(url="http://www.3andena.com", cookies={'store_language':'en'})
于 2012-05-19T17:13:43.073 回答