python - Scrapy - 限制中间目录（Python）

Question

SgmlLinkExtractor 规则中有没有办法只允许 /static/ 和 /otherstuff/ 之间的目录数量有限（比如 3 个）？所以在下面的例子中，EX1 不会被爬取（因为 /static/ 和 /otherstuff/ 之间有四个目录），但 EX2 会。

EX1：http://www.domain.com/static/d1/d2/d3/d4/otherstuff/otherstuff2/bunchacrap
EX2：http:///www.domain.com/static/d1/d2/otherstuff/otherstuff2/bunchacrap

假设 /static/ 和 /otherstuff/ 总是在我想要的目录的两侧。

感谢 TON 的任何帮助！

score 1 · Accepted Answer

您可以在参数中使用正则表达式，也可以在allow参数中使用测试函数process_value。（请参阅文档。）

两者都有其优点和缺点，这取决于页面中链接的外观。如果您使用正则表达式，您将针对完全限定的 url（即http://domain.com/foo/bar）进行测试。如果您使用该process_value参数，您将获得在网页中找到的原始值（即 /foo/bar 或更糟，一个相对链接）。

例如，正则表达式domain.com/(?:\w+/){1,3}\w+$匹配

domain.com/foo/bar
domain.com/foo/bar/foo
domain.com/foo/bar/foo/bar

但不是

domain.com/foo/
domain.com/foo/bar/foo/bar/foo

如果您使用process_value，这样的功能将起作用

def filter_path(value):
    # at least 2, at most 3 /'s
    if 1 < value.count('/') < 4:
        return value

上面的函数假设您的 html 链接具有 href 的值，如/foo、/foo/bar/foo等。

在您的特定情况下，正则表达式将类似于domain.com/static/(?:\w+/){3}otherstuff，并且该filter_path函数可能会检查value.startswith('/static/')和后缀。

如果您Rule在CrawlSpider. 该process_links参数允许您传递一个函数来处理链接列表。例如

def url_allowed(url):
    # check for the pattern /static/dir/dir/dir/ etc
    return True

def process_links(links):
    return [l for l in links if url_allowed(l.url)]

python - Scrapy - 限制中间目录（Python）

1 回答 1

Related

Reference