regex - 如何在scrapy中使用正则表达式设置规则以提取网址？

Question

我想在彭博网站上抓取与迪士尼相关的页面。网址遵循模式为

        "http://bloomberg.com/news/2013-07-08/disney-welcometohomepageofdisney"

所以，我写了下面的规则

          rules = [
    Rule(SgmlLinkExtractor(allow=('/news/*/disney*',)), follow=True),
          ]

但是上面的规则没有按我的意愿工作，我得到了与迪士尼无关的爬网页面输出。请帮助解决此规则。

score 3 · Accepted Answer

3

/news/*匹配/news后跟任意数量的/.

正确的正则表达式是：

/news/.*/disney

于 2013-08-09T11:51:55.190 回答

score 1 · Accepted Answer

您可能需要以下正则表达式：

 /news/[^/]+/disney.*

逃脱的看起来像

\/news\/[^\/]+\/disney.*

这样你会找到下一个/但什么都没有。

2 回答 2