3

I want Lucene Scoring function to have no bias based on the length of the document. This is really a follow up question to Calculate the score only based on the documents have more occurance of term in lucene

I was wondering how Field.setOmitNorms(true) works? I see that there are two factors that make short documents get a high score:

  1. "boost" that shorter length posts - using doc.getBoost()
  2. "lengthNorm" in the definition of norm(t,d)

Here is the documentation

I was wondering - if I wanted no bias towards shorter documents, is Field.setOmitNorms(true) enough?

4

2 回答 2

1

使用 BM25Similarity 你可以减少到 0f:

@param b 控制文档长度标准化 tf 值的程度

或者

@param k1 控制非线性项频率归一化(饱和度)。

两个参数都会影响 SimWeight

indexSearcher.setSimilarity(new BM25Similarity(1.2f,0f));

更多解释可以在这里找到:http: //opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

于 2017-06-05T18:04:56.537 回答
0

当您使用 TF-IDF 评分时,较短的文档意味着更相关。

您可以在 Lucene 中使用自定义评分函数。它很容易自定义评分算法。子类 DefaultSimilarity 并覆盖您要自定义的方法。

这里有一个代码示例可以帮助您实现它

于 2014-07-18T07:35:43.090 回答