python - 在 Scrapy 中初始化 CrawlSpider

Question

我在 Scrapy 中编写了一个蜘蛛，它基本上做得很好，并且完全按照它应该做的。问题是我需要对其进行一些小改动，并且我尝试了几种方法都没有成功（例如修改 InitSpider）。这是脚本现在应该执行的操作：

抓取起始网址http://www.example.de/index/search?method=simple
现在继续访问网址http://www.example.de/index/search?filter=homepage
使用规则中定义的模式从这里开始爬行

所以基本上所有需要改变的就是在两者之间调用一个 URL。我宁愿不用 BaseSpider 重写整个事情，所以我希望有人对如何实现这一点有一个想法:)

如果您需要任何其他信息，请告诉我。您可以在下面找到当前脚本。

#!/usr/bin/python
# -*- coding: utf-8 -*-

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from example.items import ExampleItem
from scrapy.contrib.loader.processor import TakeFirst
import re
import urllib

take_first = TakeFirst()

class ExampleSpider(CrawlSpider):
    name = "example"
    allowed_domains = ["example.de"]

    start_url = "http://www.example.de/index/search?method=simple"
    start_urls = [start_url]

    rules = (
        # http://www.example.de/index/search?page=2
        # http://www.example.de/index/search?page=1&tab=direct
        Rule(SgmlLinkExtractor(allow=('\/index\/search\?page=\d*$', )), callback='parse_item', follow=True),
        Rule(SgmlLinkExtractor(allow=('\/index\/search\?page=\d*&tab=direct', )), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)

        # fetch all company entries
        companies = hxs.select("//ul[contains(@class, 'directresults')]/li[contains(@id, 'entry')]")
        items = []

        for company in companies:
            item = ExampleItem()
            item['name'] = take_first(company.select(".//span[@class='fn']/text()").extract())
            item['address'] = company.select(".//p[@class='data track']/text()").extract()
            item['website'] = take_first(company.select(".//p[@class='customurl track']/a/@href").extract())

            # we try to fetch the number directly from the page (only works for premium entries)
            item['telephone'] = take_first(company.select(".//p[@class='numericdata track']/a/text()").extract())

            if not item['telephone']:
              # if we cannot fetch the number it has been encoded on the client and hidden in the rel=""
              item['telephone'] = take_first(company.select(".//p[@class='numericdata track']/a/@rel").extract())

            items.append(item)
        return items

编辑

这是我对 InitSpider 的尝试：https ://gist.github.com/150b30eaa97e0518673a 我从这里得到了这个想法：Crawling with an authenticated session in Scrapy

如您所见，它仍然继承自 CrawlSpider，但我对核心 Scrapy 文件进行了一些更改（不是我最喜欢的方法）。我让 CrawlSpider 从 InitSpider 而不是 BaseSpider ( source ) 继承。

到目前为止，这是可行的，但是蜘蛛只是在第一页之后停止，而不是拾取所有其他页面。

此外，这种方法对我来说似乎完全没有必要:)

score 3 · Accepted Answer

好的，我自己找到了解决方案，它实际上比我最初想象的要简单得多:)

这是简化的脚本：

#!/usr/bin/python
# -*- coding: utf-8 -*-

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy import log
from scrapy.selector import HtmlXPathSelector
from example.items import ExampleItem
from scrapy.contrib.loader.processor import TakeFirst
import re
import urllib

take_first = TakeFirst()

class ExampleSpider(BaseSpider):
    name = "ExampleNew"
    allowed_domains = ["www.example.de"]

    start_page = "http://www.example.de/index/search?method=simple"
    direct_page = "http://www.example.de/index/search?page=1&tab=direct"
    filter_page = "http://www.example.de/index/search?filter=homepage"

    def start_requests(self):
        """This function is called before crawling starts."""
        return [Request(url=self.start_page, callback=self.request_direct_tab)]

    def request_direct_tab(self, response):
        return [Request(url=self.direct_page, callback=self.request_filter)]

    def request_filter(self, response):
        return [Request(url=self.filter_page, callback=self.parse_item)]

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)

        # fetch the items you need and yield them like this:
        # yield item

        # fetch the next pages to scrape
        for url in hxs.select("//div[@class='limiter']/a/@href").extract():
            absolute_url = "http://www.example.de" + url             
            yield Request(absolute_url, callback=self.parse_item)

如您所见，我现在正在使用 BaseSpider 并在最后自己生成新的请求。在开始的时候，我只是简单地浏览了在开始爬行之前需要提出的所有不同请求。

我希望这对某人有帮助:) 如果您有任何问题，我很乐意回答。

python - 在 Scrapy 中初始化 CrawlSpider

1 回答 1

Related

Reference