我正在抓取具有 ?locale=en 或 locale=jp... 的网站
我只对在 url 中没有指定语言环境的网站感兴趣。
目前我有这个:
# More specific ones at the top please
# In general, deny all locale specified links
rules = (
# Matches looks
# http://lookbook.nu/look/4273137-Galla-Spectrum-Yellow
Rule(SgmlLinkExtractor(allow=('/look/\d+'), deny=('\?locale=')), callback='parse_look'),
# Matches all looks page under user overview,
Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]+/looks/?$'), deny=('\?locale=')),
callback='parse_model_looks'),
Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]+/looks\?page=\d+$'), deny=('\?locale=')),
callback='parse_model_looks'),
# Matches all user overview pages
Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]*/?$'), deny=('\?locale=')),
callback='parse_model_overview'),
我到处重复否认。
应该有更好的方法吧?
我尝试做一个一般规则来拒绝所有 \?locale= 但这没有用。