elasticsearch - 对于 ElasticSearch match_phrase 查询，如何考虑词序但不要求所有搜索词都存在于文档中？

Question

假设我的索引有两个文档：

“拿我的钱”
“我的钱到了”

当我对“get my money”进行常规匹配查询时，两个文档都正确匹配，但得分相同。但是，我希望在评分过程中单词的顺序很重要。换句话说，我希望“得到我的钱”有更高的分数。

因此，我尝试将匹配查询放在 bool 查询的 must 子句中，并包含一个 match_phrase（具有相同的查询字符串）。在我用“我如何得到我的钱”进行搜索之前，这似乎正确地得分。在这种情况下，match_phrase 查询似乎不匹配，并且再次以相同的分数返回命中。

如何构建我的索引/查询，以便考虑词序但不要求所有搜索的词都存在于文档中？

与测试数据的索引映射

PUT test-index
{
  "mappings": {
      "properties" : {
        "keyword" : {
          "type" : "text",
          "similarity": "boolean"
        }
      }
    }
}

POST test-index/_doc/
{
    "keyword" : "get my money"
}
POST test-index/_doc/
{
    "keyword" : "my money get here"
}

查询“我如何获得我的钱” - 没有按需要工作

GET /test-index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "keyword": "how do i get my money"
          }
        }
      ],
      "should": [
        {
          "match_phrase": {
            "keyword": {
              "query": "how do i get my money"
            }
          }
        }
      ]
    }
  }
}

结果（两份文件得分相同）

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 3.0,
    "hits" : [
      {
        "_index" : "test-index",
        "_type" : "_doc",
        "_id" : "6Xy8wXIB3NtI_ttPGBoV",
        "_score" : 3.0,
        "_source" : {
          "keyword" : "get my money"
        }
      },
      {
        "_index" : "test-index",
        "_type" : "_doc",
        "_id" : "6ny8wXIB3NtI_ttPGBpV",
        "_score" : 3.0,
        "_source" : {
          "keyword" : "my money get here"
        }
      }
    ]
  }
}

编辑 1

正如@gibbs 建议的那样，让我们删除"similarity": "boolean". 下面介绍了一个更加简化和集中的问题。我们正在努力寻找这个问题的答案。

已移除"similarity": "boolean"

PUT test-index
{
  "mappings": {
      "properties" : {
        "keyword" : {
          "type" : "text"
        }
      }
    }
}

POST test-index/_doc/
{
    "keyword": "get my money"
}
POST test-index/_doc/
{
    "keyword": "my money get here"
}

如何使这个查询返回结果？现在没有了。使用时如果文档中不存在所有搜索词，是否可以返回结果match_phrase？

GET /test-index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "keyword": {
              "query": "how do I get my money"
            }
          }
        }
      ]
    }
  }
}

编辑 2

在我们的用例中，我们不能使用 BM25 (TF/IDF)，因为这会破坏我们的结果。

POST test-index/_doc
{
  "keyword": "get my money, claim, distribution, getting started"
}

POST test-index/_doc 
{
  "keyword": "my money get here"
}

GET /test-index/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "keyword": "how do I get my money"
          }
        }
      ]
    }
  }
}

结果

{
  "took" : 16,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.6156533,
    "hits" : [
      {
        "_index" : "test-index",
        "_type" : "_doc",
        "_id" : "JnxCw3IB3NtI_ttPBjQv",
        "_score" : 0.6156533,
        "_source" : {
          "keyword" : "my money get here"
        }
      },
      {
        "_index" : "test-index",
        "_type" : "_doc",
        "_id" : "x3xSw3IB3NtI_ttP1DUi",
        "_score" : 0.49206492,
        "_source" : {
          "keyword" : "get my money, claim, distribution, getting started"
        }
      }
    ]
  }
}

在这种情况下，由于 TF/IDF，我的钱得到的分数比预期的要高。所以，我们不能让分数计算取决于匹配的文档数量、字段长度等。

对不起，很长的问题。那么，回到我原来的问题，如何构建我的索引/查询，以便考虑词序但不要求所有搜索的词都存在于文档中？

score 0 · Accepted Answer

The problem is because of your similarity parameter.

A simple boolean similarity, which is used when full-text ranking is not needed and the score should only be based on whether the query terms match or not. Boolean similarity gives terms a score equal to their query boost

Reference

You should use other similarity parameters (BM25) to get better scores.

I removed similarity parameter from your mapping and indexed same data. It used default similarity parameter.

Score is as follows.

{
    "took": 1069,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.5809142,
        "hits": [
            {
                "_index": "test-index",
                "_type": "_doc",
                "_id": "WpaHwnIBa8oXh9OgX4Hb",
                "_score": 0.5809142,
                "_source": {
                    "keyword": "get my money"
                }
            },
            {
                "_index": "test-index",
                "_type": "_doc",
                "_id": "W5aHwnIBa8oXh9OgeYG9",
                "_score": 0.5167642,
                "_source": {
                    "keyword": "my money get here"
                }
            }
        ]
    }
}

elasticsearch - 对于 ElasticSearch match_phrase 查询，如何考虑词序但不要求所有搜索词都存在于文档中？

编辑 1

编辑 2

1 回答 1

Related

Reference