2

Let me preface by saying that I'm not using Lucene in a very common way and explain how my question makes sense. I'm using Lucene to do searches in structured records. That is, each document, that is indexed, is a set of fields with short values from a given set. Each field is analysed and stored, the analysis producing usually no more than 3 and in most cases just 1 normalised token. As an example, imagine files for each of which we store two fields: the path to the file and a user rating in 1-5. The path is tokenized with a PathHierarchyTokenizer and the rating is just stored as-is. So, if we have a document like

path: "/a/b/file.txt"
rating: 3

This document will have for its path field the tokens "/a", "/a/b" and "/a/b/file.ext", and for rating the token "3".

I wish to score this document against a query like "path:/a path:/a/b path:/a/b/different.txt rating:1" and get a value of 2 - the number of terms that match.

My understanding and observation is that the score of the document depends on various term metrics and with many documents with many fields each, I most definitely am not getting simple integer scores.

Is there some way to make Lucene score documents in the outlined fashion? The queries that are run against the index are not generated by the users, but are built by the system and have an optional filter attached, meaning they all have a fixed form of several TermQuerys joined in a BooleanQuery with nothing like any fuzzy textual searches. Currently I don't have the option of replacing Lucene with something else, but suggestions are welcome for a future development.

4

1 回答 1

1

我怀疑有什么东西可以使用,所以很可能你需要实现自己的记分器并在搜索时使用它。对于复杂的情况,您可能希望使用 query,但对于像您这样的简单情况,将设置因子覆盖DefaultSimilaritytf为原始频率(相关文档中指定术语的数量)和所有其他组件为 1 就足够了。像这样:

public class MySimilarity extends DefaultSimilarity {

    @Override
    public float computeNorm(String field, FieldInvertState state) {
        return 1;
    }

    @Override
    public float queryNorm(float sumOfSquaredWeights) {
        return 1;
    }

    @Override
    public float tf(float freq) {
        return freq;
    }

    @Override
    public float idf(int docFreq, int numDocs) {
        return 1;
    }

    @Override
    public float coord(int overlap, int maxOverlap) {
        return 1;
    }

}

(注意,这tf()是返回不同于 1 的唯一方法)

和刚刚设置的相似度IndexSearcher

于 2013-08-14T12:48:04.713 回答