在 Nutch wiki 中,它建议使用以下内容:
bin/nutch solrindex <solr url> <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
目的是什么
[-filter] [-normalize]
当 Nutch 有大量的过滤器和规范化配置文件时?
automaton-urlfilter.txt
domain-urlfilter.txt
regex-urlfilter.txt
suffix-urlfilter.txt
regex-normalize.xml
host-urlnormalizer.txt