0

如何将 url 添加到 SgmlLinkExtractor?也就是说,如何添加任意 url 来运行回调?

详细说明,以dirbot为例:https ://github.com/scrapy/dirbot/blob/master/dirbot/spiders/googledir.py

parse_category仅访问与 SgmlLinkExtractor 匹配的所有内容 SgmlLinkExtractor(allow='directory.google.com/[AZ][a-zA-Z_/]+$')

4

2 回答 2

0

Use BaseSpider instead of CrawlSpider, then set add to start_requests or start_urls []

class MySpider(BaseSpider):
    name = "myspider"

    def start_requests(self):
        return [Request("https://www.example.com",
            callback=self.parse)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        ...
于 2011-11-21T05:06:03.820 回答
0

类ThemenHubSpider(CrawlSpider):

name = 'themenHub'
allowed_domains = ['themen.t-online.de']
start_urls = ["http://themen.t-online.de/themen-a-z/a"]
rules = [Rule(SgmlLinkExtractor(allow=['id_\d+']), 'parse_news')]
于 2013-01-15T16:42:10.327 回答