使用 Lucene 3.x 我有这个:
new CustomScoreQuery(bigramQuery, new FieldScoreQuery("bigram-count", Type.BYTE)) {
protected CustomScoreProvider getCustomScoreProvider(IndexReader ir) {
return new CustomScoreProvider(ir) {
public double customScore(int docnum, float bigramFreq, float docBigramCount) {
... calculate Dice's coefficient using bigramFreq and docBigramCount...
if (diceCoeff >= threshold) {
String[] stems = ir.document(docnum).getValues("stems");
... calculate document similarity score using stems ...
}
}
};
}
}
这种方法可以有效地从存储字段中检索缓存float
值,我用它来获取文档的二元计数;它不允许检索字符串,因此我需要加载文档以获取计算文档相似度分数所需的内容。在 Lucene 4.1 更改为压缩存储字段之前,它工作正常。
利用 Lucene 4 中的增强功能的正确方法是这样参与DocValues
:
new CustomScoreQuery(bigramQuery) {
protected CustomScoreProvider getCustomScoreProvider(ReaderContext rc) {
final AtomicReader ir = ((AtomicReaderContext)rc).reader();
final ValueSource
bgCountSrc = ir.docValues("bigram-count").getSource(),
stemSrc = ir.docValues("stems").getSource();
return new CustomScoreProvider(rc) {
public float customScore(int docnum, float bgFreq, float... fScores) {
final long bgCount = bgCountSrc.getInt(docnum);
... calculate Dice's coefficient using bgFreq and bgCount ...
if (diceCoeff >= threshold) {
final String stems =
stemSrc.getBytes(docnum, new BytesRef())).utf8ToString();
... calculate document similarity score using stems ...
}
}
};
}
}
这导致性能从 16 毫秒(Lucene 3.x)提高到 10 毫秒(Lucene 4.x)。