web-crawler - Scrapy SgmlLinkExtractor 可以在允许中使用查询参数吗？

Question

谁能向我解释为什么下面的代码找不到任何要遵循的链接？

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item

class CoursesSpider(CrawlSpider):
    name = "courses"
    allowed_domains = ["test.com"]
    start_urls = [
    "http://golfpiste.com/kentat/?p=seuralista"
    ]

    rules = (
    Rule(SgmlLinkExtractor(allow=r"kentat/esittely/\?lang=fi",unique=True),callback='parse_item', follow=True),
    )

def parse_item(self, response):
    self.log('Hi, this is an item page! %s' % response.url)
    item =  Item()
    return item

问题是 allow=r"kentat/esittely/\? 会找到要关注的链接，但是一旦我添加任何查询参数，即使 kentat/esittely/?lang=fi 链接肯定存在，它也找不到任何链接。

所以我想知道 SgmlLinkExtractor 是否甚至可以在“允许”中包含查询参数，或者我做错了什么？

score 1 · Accepted Answer

1

起始 url 和链接提取器规则错误。规则应该是“kentat/esittely.\?seura=”。

于 2013-08-23T07:44:39.547 回答

web-crawler - Scrapy SgmlLinkExtractor 可以在允许中使用查询参数吗？

1 回答 1

Related

Reference