0

Let's say I have a Confluence instance, and I want to crawl it and store the results in Solr as part of an intranet search engine.

Now let's say I only want to store a subset of the pages (matching a regex) on the Confluence instance as part of the search engine.

But, I do want Nutch to crawl all the other pages, looking for links to pages that match—I just don't want Nutch to store them (or at least I don't want Solr to return them in the results).

What's the normal or least painful way to set Nutch->Solr up to work like this?

4

1 回答 1

1

看起来这样做的唯一方法是编写自己的 IndexFilter 插件(或找人复制)。

[当它正常工作时将在此处添加我的示例插件代码]

参考:

于 2013-08-30T15:36:34.610 回答