在某些时候,您的查询分析与您的文档分析不匹配。
您很可能在内部使用 Lucene 的StandardAnalyzer进行查询解析,但不是在索引时使用,如下所示:
@SearchableMetaData(name="ordering_name", index=Index.NOT_ANALYZED))
此分析器中使用的StandardTokenizer将字符/
视为单词边界(例如空格),生成标记n
和a
. 稍后,令牌a
被StopFilter删除。
以下代码是此解释的示例(输入为"c/d e/f n/a"
):
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
TokenStream tokenStream = analyzer.tokenStream("CONTENT", new StringReader("c/d e/f n/a"));
CharTermAttribute term = tokenStream.getAttribute(CharTermAttribute.class);
PositionIncrementAttribute position = tokenStream.getAttribute(PositionIncrementAttribute.class);
int pos = 0;
while (tokenStream.incrementToken()) {
String termStr = term.toString();
int incr = position.getPositionIncrement();
if (incr == 0 ) {
System.out.print(" [" + termStr + "]");
} else {
pos += incr;
System.out.println(" " + pos + ": [" + termStr +"]");
}
}
您将看到以下提取的令牌:
1: [c]
2: [d]
3: [e]
4: [f]
5: [n]
请注意,预期的位置 6: with tokena
缺失。如您所见,Lucene 的QueryParser也执行此标记化:
QueryParser parser = new QueryParser(Version.LUCENE_36, "content", new StandardAnalyzer(Version.LUCENE_36));
System.out.println(parser.parse("+n/a*"));
输出是:
+content:n
编辑:解决方案是使用WhitespaceAnalyzer,并将字段设置为 ANALYZED。以下代码是 Lucene 下的概念证明:
IndexWriter writer = new IndexWriter(new RAMDirectory(), new IndexWriterConfig(Version.LUCENE_36, new WhitespaceAnalyzer(Version.LUCENE_36)));
Document doc = new Document();
doc.add(new Field("content","Temp 0 New n/a", Store.YES, Index.ANALYZED));
writer.addDocument(doc);
writer.commit();
IndexReader reader = IndexReader.open(writer, true);
IndexSearcher searcher = new IndexSearcher(reader);
BooleanQuery query = new BooleanQuery();
QueryParser parser = new QueryParser(Version.LUCENE_36, "content", new WhitespaceAnalyzer(Version.LUCENE_36));
TopDocs docs = searcher.search(parser.parse("+n/a"), 10);
System.out.println(docs.totalHits);
writer.close();
输出是:1
。