python - Scrapy SgmlLinkExtractor 忽略允许的链接

Question

请查看 Scrapy 文档中的这个蜘蛛示例。解释是：

该蜘蛛将开始爬取 example.com 的主页，收集类别链接和项目链接，并使用 parse_item 方法解析后者。对于每个项目响应，将使用 XPath 从 HTML 中提取一些数据，并用它填充一个项目。

我完全复制了同一个蜘蛛，并用另一个初始网址替换了“example.com”。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from stb.items import StbItem

class StbSpider(CrawlSpider):
    domain_name = "stb"
    start_urls = ['http://www.stblaw.com/bios/MAlpuche.htm']

    rules = (Rule(SgmlLinkExtractor(allow=(r'/bios/.\w+\.htm', )), callback='parse', follow=True), )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        item = StbItem()
        item['JD'] = hxs.select('//td[@class="bodycopysmall"]').re('\d\d\d\d\sJ.D.')
        return item

SPIDER = StbSpider()

但是我的蜘蛛“stb”没有像它应该做的那样从“/bios/”收集链接。它运行初始 url，抓取item['JD']并写入文件，然后退出。

为什么会SgmlLinkExtractor被忽略？被Rule读取是因为它捕获了Rule行内的语法错误。

这是一个错误吗？我的代码有问题吗？除了我在每次运行时看到的一堆未处理的错误外，没有任何错误。

很高兴知道我在这里做错了什么。感谢您提供任何线索。我误解了SgmlLinkExtractor应该做什么吗？

score 11 · Accepted Answer

该parse函数实际上是在 CrawlSpider 类中实现和使用的，而您无意中覆盖了它。如果您将名称更改为其他名称，例如parse_item，则该规则应该有效。

python - Scrapy SgmlLinkExtractor 忽略允许的链接

1 回答 1

Related

Reference