java - Why Lucene does not return the results based on whole word match?

Question

I am using Lucene to match the keywords with list of words within an application. The whole process is automated without any human intervention. Best matched result (the one on the top and highest score) is picked from the results list returned from Lucene.

The following code demonstrates the above functionality and the results are printed on console.

Problem :

The problem is that lucene searches the keyword (word to be searched) and gives as a result a word that partially matches the keyword. On the other hand the full matched result also exists and does not get ranked in the first position.

For example, if I have lucene RAM index that contains words 'Test' and 'Test Engineer'. If i want to search index for 'AB4_Test Eng_AA0XY11' then results would be

Test
Test Engineer

Although Eng in 'AB4_Test Eng_AA0XY11' matched for Engineer (that is why it is listed in results). But it does not get the top position. I want to optimize my solution to bring the 'Test Engineer' on top because it the best match that considers whole keyword. Can any one help me in solving this problem?

public class LuceneTest {

private static void search(Set<String> keywords) {

    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
    try {
        // 1. create the index
        Directory luceneIndex = buildLuceneIndex(analyzer);

        int hitsPerPage = 5;
        IndexReader reader = IndexReader.open(luceneIndex);

        for(String keyword : keywords) {

            // Create query string. replace all underscore, hyphen, comma, ( , ), {, }, . with plus sign
            StringBuilder querystr = new StringBuilder(128);
            String [] splitName = keyword.split("[\\-_,/(){}:. ]");

            // After tokenizing also add plus sign between each camel case word. 
            for (String token : splitName) {
                querystr.append(token + "+");
            }

            // 3. search
            IndexSearcher searcher = new IndexSearcher(reader);
            TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);

            Query q = new QueryParser(Version.LUCENE_36, "name", analyzer).parse(querystr.toString());
            searcher.search(q, collector);
            ScoreDoc[] hits = collector.topDocs().scoreDocs;

            System.out.println();
            System.out.println(keyword);
            System.out.println("----------------------");
            for (ScoreDoc scoreDoc : hits) {
                Document d = searcher.doc(scoreDoc.doc);
                System.out.println("Found " + d.get("id") +  " : " + d.get("name"));
            }

            // searcher can only be closed when there
            searcher.close();
        }

    }catch (Exception e) {
        e.printStackTrace();
    }
}

/**
 * 
 */
private static Directory buildLuceneIndex(Analyzer analyzer) throws CorruptIndexException, LockObtainFailedException, IOException{

    Map<Integer, String> map = new HashMap<Integer, String>();
    map.put(1, "Test Engineer");
    map.put(2, "Test");

    Directory index = new RAMDirectory();
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);

    // 1. create the index
    IndexWriter w = new IndexWriter(index, config);
    for (Map.Entry<Integer, String> entry : map.entrySet()) {
        try {
            Document doc = new Document();
            doc.add(new Field("id", entry.getKey().toString(), Field.Store.YES, Field.Index.ANALYZED));
            doc.add(new Field("name", entry.getValue() , Field.Store.YES, Field.Index.ANALYZED));
            w.addDocument(doc);

        }catch (Exception e) {
            e.printStackTrace();
        }
    }

    w.close();

    return index;
}


public static void main(String[] args) {

    Set<String> list = new TreeSet<String>();

    list.add("AB4_Test Eng_AA0XY11");
    list.add("AB4_Test Engineer_AA0XY11");

    search(list);
}
}

score 0 · Accepted Answer

如果这两个结果（测试，测试工程师）具有相同的排名分数，那么您将按照它们出现的顺序查看它们。您应该尝试使用长度过滤器并增强术语，然后您可能会想出解决方案。

另请参阅：将完全匹配排名为最高的最佳 lucene 设置是什么

score 0 · Accepted Answer

您可以查看Lucene 查询语法规则，了解如何强制搜索Test Engineer.

基本上，使用诸如

 AB4_Test AND Eng_AA0XY11

可以工作，虽然我不确定。上面链接指向的页面非常简洁，您将能够快速找到可以满足您需求的查询。

java - Why Lucene does not return the results based on whole word match?

2 回答 2

Related

Reference