使用 Storm Crawler 1.11 和 Elastic Search 6.5.x 并尝试应用fastfilterfilter。第一个过滤器工作正常,其余过滤器仅抓取父 URL。我的配置中是否缺少任何内容或需要进行任何更改才能抓取所有五个网址。
我的种子网址
https://www.abce.com/ghi/ seed=ghi
https://www.abce.com/jkl/ seed=jkl
https://www.abce.com/mno/ seed=mno
https://mnop.edu/ seed=mnop
https://jqkl.edu/ seed=jqkl
fasturlfilter.json
[
{
"scope":"domain:abce.com",
"patterns":[
"AllowPath /ghi/",
"AllowPath /jkl/",
"AllowPath /mno/",
"DenyPath .+"
]
},
{
"scope":"domain:mnop.edu",
"patterns":[
"AllowPath /",
"DenyPath .+"
]
},
{
"scope":"domain:jqkl.edu",
"patterns":[
"AllowPath /",
"DenyPath .+"
]
}
]