lucene - Lucene - 给定一组可索引关键字的专用 TokenStream/Analyzer

Question

我有以下情况我有一组要索引的文档。但我需要在我索引的内容中有所选择。

选择标准：文档必须包含给定的关键字之一Set。

这部分很简单，我可以检查文档中是否存在任何这些关键字，然后才索引文档。棘手的情况是（无论如何对我来说！），我只想索引这些关键字。这些关键字可以是多字词，也可以是正则表达式。

这些关键字将是什么对这篇文章毫无意义，因为我可以将其抽象出来——我可以生成需要索引的关键字列表。

我可以使用现有的 TokenStream、Analyzer、Filter 组合吗？如果没有，请有人指出我正确的方向。

如果我的问题不够清楚：

HashSet<String> impKeywords = new HashSet<String>(new String[] {"Java", "Lucene"});

我有一个Content我使用的类，说：

Content content = new Content("I am only interested in Java, Lucene, Nutch, Luke, CommonLisp.");

并且，假设我有一种获取匹配关键字的方法：

HashSet<String> matchingKeywords = content.getMatchingKeywords(impKeywords); // returns a set with "Java" and "Lucene"

如果有matchingKeywords，才继续对文档进行索引；所以：

if(!matchingKeywords.isEmpty()) {
    // prepare document for indexing, and index.
    // But what should be my Analyzer and TokenStream?
}

我希望能够创建一个带有只返回这些匹配关键字的 TokenStream 的分析器，因此只有这些标记被索引。

尾注：一种可能性似乎是，对于每个文档，我使用每个匹配的关键字添加可变数量的字段。这些字段被索引但不使用分析Field.Index.NOT_ANALYZED。但是，如果我能够为此目的找出一个预先存在的 Analyzer/TokenStream 而不是玩弄字段会更好。

score 0 · Accepted Answer

按照@femtoRgon 的建议，我已按如下方式解决了上述问题。

如问题中所述，我有：

HashSet<String> impKeywords = new HashSet<String>(new String[] {"Java", "Lucene"});

我有一个Content我使用的类，如下所示：

Content content = new Content("I am only interested in Java, Lucene, Nutch, Luke, CommonLisp.");

而且，我有一种方法来获取匹配的关键字：

HashSet<String> matchingKeywords = content.getMatchingKeywords(impKeywords); // returns a set with "Java" and "Lucene" for this example `content`.

如果有matchingKeywords，才继续对文档进行索引；所以在索引时我做了：

if(!matchingKeywords.isEmpty()) {
    Document doc = new Document();
    for(String keyword: matchingKeywords) {   
        doc.add(new Field("keyword", keyword, Field.Store.YES, Field.Index.NOT_ANALYZED);
    }
    iwriter.addDocument(doc); // iwriter is the instance of IndexWriter
}

然后，在搜索时，我创建了以下布尔查询：

BooleanQuery boolQuery = new BooleanQuery();

for(String queryKeyword: searchKeywords)) {
    boolQuery.add(new TermQuery(new Term("keyword", queryKeyword)), BooleanClause.Occur.SHOULD);
}

ScoreDoc[] hits = isearcher.search(boolQuery, null, 1000).scoreDocs; // isearcher is the instance of IndexSearcher

希望这个答案可以帮助有类似需求的人。

lucene - Lucene - 给定一组可索引关键字的专用 TokenStream/Analyzer

1 回答 1

Related

Reference