java - Lucene Scoring Function - bias towards shorter documents

Question

I want Lucene Scoring function to have no bias based on the length of the document. This is really a follow up question to Calculate the score only based on the documents have more occurance of term in lucene

I was wondering how Field.setOmitNorms(true) works? I see that there are two factors that make short documents get a high score:

I was wondering - if I wanted no bias towards shorter documents, is Field.setOmitNorms(true) enough?

score 1 · Accepted Answer

使用 BM25Similarity 你可以减少到 0f：

@param b 控制文档长度标准化 tf 值的程度

或者

@param k1 控制非线性项频率归一化（饱和度）。

两个参数都会影响 SimWeight

indexSearcher.setSimilarity(new BM25Similarity(1.2f,0f));

score 0 · Accepted Answer

当您使用 TF-IDF 评分时，较短的文档意味着更相关。

您可以在 Lucene 中使用自定义评分函数。它很容易自定义评分算法。子类 DefaultSimilarity 并覆盖您要自定义的方法。

这里有一个代码示例可以帮助您实现它

2 回答 2