Let's say I have a Confluence instance, and I want to crawl it and store the results in Solr as part of an intranet search engine.
Now let's say I only want to store a subset of the pages (matching a regex) on the Confluence instance as part of the search engine.
But, I do want Nutch to crawl all the other pages, looking for links to pages that match—I just don't want Nutch to store them (or at least I don't want Solr to return them in the results).
What's the normal or least painful way to set Nutch->Solr up to work like this?