elasticsearch - 如果 min_gram 设置为 1，ngram 过滤器上的 Elasticsearch 突出显示很奇怪

Question

所以我有这个索引

{
  "settings":{
    "index":{
      "number_of_replicas":0,
      "analysis":{
        "analyzer":{
          "default":{
            "type":"custom",
            "tokenizer":"keyword",
            "filter":[
              "lowercase",
              "my_ngram"
            ]
          }
        },
        "filter":{
          "my_ngram":{
            "type":"nGram",
            "min_gram":2,
            "max_gram":20
          }
        }
      }
    }
  }
}

我正在通过轮胎宝石进行搜索

{
   "query":{
      "query_string":{
         "query":"xyz",
         "default_operator":"AND"
      }
   },
   "sort":[
      {
         "count":"desc"
      }
   ],
   "filter":{
      "term":{
         "active":true,
         "_type":null
      }
   },
   "highlight":{
      "fields":{
         "name":{

         }
      },
      "pre_tags":[
         "<strong>"
      ],
      "post_tags":[
         "</strong>"
      ]
   }
}

我有两个应该匹配名为“xyz 帖子”和“xyz 问题”的帖子当我执行此搜索时，我会正确恢复突出显示的字段

<strong>xyz</strong> question
<strong>xyz</strong> post

现在事情就是这样......只要我在索引和重新索引中将 min_gram 更改为 1。突出显示的字段开始返回，因为

<strong>x</strong><strong>y</strong><strong>z</strong> pos<strong>xyz</strong>t
<strong>x</strong><strong>y</strong><strong>z</strong> questio<strong>xyz</strong>n

我简直无法理解为什么。

score 12 · Accepted Answer

简答

您需要检查您的映射，看看您是否使用fast-vector-highlighter. 但是您仍然需要非常小心您的查询。

详细解答

0.20.4假设在上使用新的 ES 实例localhost。

在您的示例之上，让我们添加显式映射。注意我为该code字段设置了两种不同的分析。唯一的区别是"term_vector":"with_positions_offsets"。

curl -X PUT localhost:9200/myindex -d '
{
  "settings" : {
    "index":{
      "number_of_replicas":0,
      "number_of_shards":1,
      "analysis":{
        "analyzer":{
          "default":{
            "type":"custom",
            "tokenizer":"keyword",
            "filter":[
              "lowercase",
              "my_ngram"
            ]
          }
        },
        "filter":{
          "my_ngram":{
            "type":"nGram",
            "min_gram":1,
            "max_gram":20
          }
        }
      }
    }
  },
  "mappings" : {
    "product" : {
      "properties" : {
        "code" : {
          "type" : "multi_field",
          "fields" : {
            "code" : {
              "type" : "string",
              "analyzer" : "default",
              "store" : "yes"
            },
            "code.ngram" : {
              "type" : "string",
              "analyzer" : "default",
              "store" : "yes",
              "term_vector":"with_positions_offsets"
            }
          }
        }
      }
    }
  }
}'

索引一些数据。

curl -X POST 'localhost:9200/myindex/product' -d '{
  "code" : "Samsung Galaxy i7500"
}'

curl -X POST 'localhost:9200/myindex/product' -d '{
  "code" : "Samsung Galaxy 5 Europa"
}'

curl -X POST 'localhost:9200/myindex/product' -d '{
  "code" : "Samsung Galaxy Mini"
}'

现在我们可以运行查询了。

1) 搜索 'i' 以查看一个字符搜索与突出显示一起使用

curl -X GET 'localhost:9200/myindex/product/_search?pretty' -d '{
  "fields" : [ "code" ],
  "query" : {
    "term" : {
      "code" : "i"
    }
  },
  "highlight" : {
    "number_of_fragments" : 0,
    "fields" : {
      "code":{},
      "code.ngram":{}
    }
  }
}'

这会产生两个搜索结果：

# 1
...
"fields" : {
  "code" : "Samsung Galaxy Mini"
},
"highlight" : {
  "code.ngram" : [ "Samsung Galaxy M<em>i</em>n<em>i</em>" ],
  "code" : [ "Samsung Galaxy M<em>i</em>n<em>i</em>" ]
}
# 2
...
"fields" : {
  "code" : "Samsung Galaxy i7500"
},
"highlight" : {
  "code.ngram" : [ "Samsung Galaxy <em>i</em>7500" ],
  "code" : [ "Samsung Galaxy <em>i</em>7500" ]
}

这次code和code.ngem字段都正确突出显示。但是当使用更长的查询时，情况会迅速改变：

2) 搜索“y m”

curl -X GET 'localhost:9200/myindex/product/_search?pretty' -d '{
  "fields" : [ "code" ],
  "query" : {
    "term" : {
      "code" : "y m"
    }
  },
  "highlight" : {
    "number_of_fragments" : 0,
    "fields" : {
      "code":{},
      "code.ngram":{}
    }
  }
}'

这产生：

"fields" : {
  "code" : "Samsung Galaxy Mini"
},
"highlight" : {
  "code.ngram" : [ "Samsung Galax<em>y M</em>ini" ],
  "code" : [ "Samsung Galaxy Min<em>y M</em>i" ]
}

这些code字段未正确突出显示（类似于您的情况）。

一件重要的事情是使用术语查询而不是query_string。

elasticsearch - 如果 min_gram 设置为 1，ngram 过滤器上的 Elasticsearch 突出显示很奇怪

1 回答 1

简答

详细解答

1) 搜索 'i' 以查看一个字符搜索与突出显示一起使用

2) 搜索“y m”

Related

Reference