solr - How do I tell Nutch to crawl through a url without storing it?

Question

Let's say I have a Confluence instance, and I want to crawl it and store the results in Solr as part of an intranet search engine.

Now let's say I only want to store a subset of the pages (matching a regex) on the Confluence instance as part of the search engine.

But, I do want Nutch to crawl all the other pages, looking for links to pages that match—I just don't want Nutch to store them (or at least I don't want Solr to return them in the results).

What's the normal or least painful way to set Nutch->Solr up to work like this?

score 1 · Accepted Answer

看起来这样做的唯一方法是编写自己的 IndexFilter 插件（或找人复制）。

[当它正常工作时将在此处添加我的示例插件代码]

参考：

solr - How do I tell Nutch to crawl *through* a url without storing it?

1 回答 1

Related

Reference

solr - How do I tell Nutch to crawl through a url without storing it?