java - lucene - 越接近标题的开头赋予更多的权重

Question

我了解如何在索引时或查询时提升字段。但是，如何提高匹配靠近标题开头的术语的分数？

例子：

Query = "lucene"

Doc1 title = "Lucene: Homepage"
Doc2 title = "I have a question about lucene?"

我希望第一个文档得分更高，因为“lucene”更接近开头（暂时忽略术语频率）。

我看到了如何使用 SpanQuery 来指定术语之间的接近度，但我不确定如何使用有关字段中位置的信息。

我在 Java 中使用 Lucene 4.1。

score 12 · Accepted Answer

我会使用 a SpanFirstQuery，它匹配字段开头附近的术语。作为所有跨度查询，它依赖于位置，在 lucene 中进行索引时默认启用。

让我们独立测试一下：您只需要提供您的SpanTermQuery和可以找到该术语的最大位置（在我的示例中为一个）。

SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("title", "lucene"));
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(spanTermQuery, 1);

鉴于您的两个文档，如果您使用StandardAnalyzer.

现在我们可以以某种方式将上述SpanFirstQuery内容与普通文本查询结合起来，并让第一个只影响分数。您可以使用 a 轻松完成此操作BooleanQuery，并将 span 查询作为 should 子句，如下所示：

Term term = new Term("title", "lucene");
TermQuery termQuery = new TermQuery(term);
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));

可能有不同的方法可以实现相同的目标，可能使用 a CustomScoreQuerytoo 或自定义代码来实现评分，但在我看来，这似乎是最简单的一种。

我用来测试它的代码打印以下输出（包括分数），TermQuery首先执行唯一的，然后是唯一的SpanFirstQuery，最后是组合的BooleanQuery：

------ TermQuery --------
Total hits: 2
title: I have a question about lucene - score: 0.26010898
title: Lucene: I have a really hard question about it - score: 0.22295055
------ SpanFirstQuery --------
Total hits: 1
title: Lucene: I have a really hard question about it - score: 0.15764984
------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------
Total hits: 2
title: Lucene: I have a really hard question about it - score: 0.26912516
title: I have a question about lucene - score: 0.09196242

这是完整的代码：

public static void main(String[] args) throws Exception {

        Directory directory = FSDirectory.open(new File("data"));

        index(directory);

        IndexReader indexReader = DirectoryReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        Term term = new Term("title", "lucene");

        System.out.println("------ TermQuery --------");
        TermQuery termQuery = new TermQuery(term);
        search(indexSearcher, termQuery);

        System.out.println("------ SpanFirstQuery --------");
        SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
        search(indexSearcher, spanFirstQuery);

        System.out.println("------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------");
        BooleanQuery booleanQuery = new BooleanQuery();
        booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
        booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));
        search(indexSearcher, booleanQuery);
    }

    private static void index(Directory directory) throws Exception {
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41, new StandardAnalyzer(Version.LUCENE_41));

        IndexWriter writer = new IndexWriter(directory, config);

        FieldType titleFieldType = new FieldType();
        titleFieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
        titleFieldType.setIndexed(true);
        titleFieldType.setStored(true);

        Document document = new Document();
        document.add(new Field("title","I have a question about lucene", titleFieldType));
        writer.addDocument(document);

        document = new Document();
        document.add(new Field("title","Lucene: I have a really hard question about it", titleFieldType));
        writer.addDocument(document);

        writer.close();
    }

    private static void search(IndexSearcher indexSearcher, Query query) throws Exception {
        TopDocs topDocs = indexSearcher.search(query, 10);

        System.out.println("Total hits: " + topDocs.totalHits);

        for (ScoreDoc hit : topDocs.scoreDocs) {
            Document result = indexSearcher.doc(hit.doc);
            for (IndexableField field : result) {
                System.out.println(field.name() + ": " + field.stringValue() +  " - score: " + hit.score);
            }
        }
    }

score 0 · Accepted Answer

摘自《Lucene In Action 2》一书

" Lucene 在包 org.apache.lucene.search.payloads 中提供了一个内置查询 PayloadTermQuery。这个查询就像 SpanTermQuery 一样，它匹配包含指定术语的所有文档并跟踪实际出现的次数（跨度）比赛。

但它更进一步，使您能够根据每个术语出现时出现的有效负载贡献一个评分因子。为此，您必须创建自己的定义 scorePayload 方法的 Similarity 类，像这样“

public class BoostingSimilarity extends DefaultSimilarity {
public float scorePayload(int docID, String fieldName,
int start, int end, byte[] payload,
int offset, int length) {
....
}

上面代码中的“start”只不过是有效载荷的起始位置。有效载荷与术语相关联。所以起始位置也适用于这个词（至少我是这么认为的..）

通过使用上面的代码，但忽略有效负载，您将可以访问评分位置的“开始”位置，然后您可以根据该开始值提高分数。

例如：新分数 = 原始分数 * ( 1.0f / start-position )

我希望上述方法有效，如果您找到任何其他有效的解决方案，请在此处发布..

java - lucene - 越接近标题的开头赋予更多的权重

2 回答 2

Related

Reference