elasticsearch - 让 ElasticSearch 对结果中的总嵌套命中数（idf？）得分高于单次命中的 tf？

Question

如果我正在修改术语，请原谅我，但我在让 ES 以对我的应用程序有意义的方式对结果进行评分时遇到问题。

我正在用几个简单的字段索引数千个用户，以及嵌套在每个用户的索引中的可能数百个子对象（即Book --> Pages数据模型）。发送到索引的 JSON 如下所示：

user_id: 1
  full_name: First User
  username: firstymcfirsterton
  posts: 
   id: 2
    title: Puppies are Awesome
    tags:
     - dog house
     - dog supplies
     - dogs
     - doggies
     - hot dogs
     - dog lovers

user_id: 2
  full_name: Second User
  username: seconddude
  posts: 
   id: 3
    title: Dogs are the best
    tags:
     - dog supperiority
     - dog
   id: 4
    title: Why dogs eat?
    tags: 
     - dog diet
     - canines
   id: 5
    title: Who let the dogs out?
    tags:
     - dogs
     - terrible music

标签是类型“标签”，使用“关键字”分析器，并提升了 10。标题没有提升。

当我搜索“狗”时，第一个用户的分数高于第二个用户。我认为这与第一个用户的 tf-idf 更高有关。然而，在我的应用程序中，理想情况下，用户发布的帖子越多，该术语就会排在第一位。

我尝试按帖子数量排序，但如果用户有很多帖子，这会产生垃圾结果。基本上我想按唯一帖子点击的数量进行排序，这样拥有更多点击帖子的用户将上升到顶部。

我将如何去做这件事。有任何想法吗？

score 2 · Accepted Answer

首先，我同意@karmi 和@Zach 的观点，即通过匹配帖子弄清楚你的意思很重要。为简单起见，我假设匹配的帖子中某处有一个单词“dog”，我们没有使用关键字分析器来使标签匹配和提升更有趣。

如果我正确理解了您的问题，您希望根据相关帖子的数量对用户进行排序。这意味着您需要搜索帖子才能找到相关帖子，然后将此信息用于用户查询。只有单独索引帖子才有可能，这意味着帖子必须是子文档或嵌套字段。

假设帖子是子文档，我们可以像这样对您的数据进行原型制作：

curl -XPOST 'http://localhost:9200/test-idx' -d '{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas" : 0
    },
    "mappings" : {
      "user" : {
        "_source" : { "enabled" : true },
        "properties" : {
            "full_name": { "type": "string" },
            "username": { "type": "string" }
        }
      },
      "post" : {
        "_parent" : {
          "type" : "user"
        },
        "properties" : {
            "title": { "type": "string"},
            "tags": { "type": "string", "boost": 10}
        }
      }
    }
}' && echo

curl -XPUT 'http://localhost:9200/test-idx/user/1' -d '{
    "full_name": "First User",
    "username": "firstymcfirsterton"
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/user/2' -d '{
    "full_name": "Second User",
    "username": "seconddude"
}'  && echo

#Posts of the first user
curl -XPUT 'http://localhost:9200/test-idx/post/1?parent=1' -d '{
    "title": "Puppies are Awesome",
    "tags": ["dog house", "dog supplies", "dogs", "doggies", "hot dogs", "dog lovers", "dog"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/2?parent=1' -d '{
    "title": "Cats are Awesome too",
    "tags": ["cat", "cat supplies", "cats"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/3?parent=1' -d '{
    "title": "One fine day with a woof and a purr",
    "tags": ["catdog", "cartoons"]
}'  && echo

#Posts of the second user
curl -XPUT 'http://localhost:9200/test-idx/post/4?parent=2' -d '{
    "title": "Dogs are the best",
    "tags": ["dog supperiority", "dog"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/5?parent=2' -d '{
    "title": "Why dogs eat?",
    "tags": ["dog diet", "canines"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/6?parent=2' -d '{
    "title": "Who let the dogs out?",
    "tags": ["dogs", "terrible music"]
}'  && echo

curl -XPOST 'http://localhost:9200/test-idx/_refresh' && echo

我们可以使用Top Children Query查询这些数据。（或者在嵌套字段的情况下，我们可以使用Nested Query获得类似的结果）

curl 'http://localhost:9200/test-idx/user/_search?pretty=true' -d '{
  "query": {
    "top_children" : {
        "type": "post",
        "query" : {
            "bool" : {
                "should": [
                    { "text" : { "title" : "dog" } },
                    { "text" : { "tags" : "dog" } }
                ]
            }
        },
        "score" : "sum"
    }
  }
}' && echo

此查询将首先返回第一个用户，因为来自匹配标签的巨大提升因子。因此，它可能看起来不像您想要的，但有几种简单的方法可以修复它。首先，我们可以降低标签字段的提升因子。10 确实是一个很大的因素，尤其是对于可以重复多次的领域。或者，我们可以修改查询以完全忽略子命中的分数，并使用最匹配的子文档的数量作为分数：

curl 'http://localhost:9200/test-idx/user/_search?pretty=true' -d '{
  "query": {
    "top_children" : {
        "type": "post",
        "query" : {
            "constant_score" : {
                "query" : {            
                    "bool" : {
                        "should": [
                            { "text" : { "title" : "dog" } },
                            { "text" : { "tags" : "dog" } }
                        ]
                    }
                }
            }
        },
        "score" : "sum"
    }
  }
}' && echo

elasticsearch - 让 ElasticSearch 对结果中的总嵌套命中数（idf？）得分高于单次命中的 tf？

1 回答 1

Related

Reference