elasticsearch - 如何对 Elasticsearch 中的分析字段执行完全匹配查询？

Question

这可能是一个非常常见的问题，但是到目前为止我得到的答案并不令人满意。

问题：我有一个由近100个字段组成的es索引。大多数字段都是stringtype 并设置为analyzed. 但是，查询既可以是部分的（match），也可以是精确的（更像term）。因此，如果我的索引包含一个带有 value 的字符串字段super duper cool pizza，则可能存在部分查询，duper super并且将与文档匹配，但是，可能存在cool pizza不应该与文档匹配的精确查询。另一方面，Super Duper COOL PIzza又要与这个文件相匹配。

到目前为止，部分匹配部分很容易，我在查询中使用AND了运算符。match但是无法完成其他类型。

我查看了与此问题相关的其他帖子，并且该帖子包含最接近的解决方案： Elasticsearch 精确匹配分析字段

在这三个解决方案中，第一个感觉非常复杂，因为我有很多字段并且我不使用 REST api，我使用 QueryBuilders 和 NativeSearchQueryBuilder 从他们的 Java api 动态创建查询。它还会产生许多可能的模式，我认为这些模式会导致性能问题。

第二个是一个更简单的解决方案，但同样，我必须维护更多（几乎）冗余数据，而且我认为使用term查询永远不会解决我的问题。

最后一个我认为有问题，它不会阻止super duper匹配super duper cool pizza不是我想要的输出。

那么我还有其他方法可以实现目标吗？如果需要进一步清除问题，我可以发布一些示例映射。我也已经保留了源代码（以防万一）。请随时提出任何改进建议。

提前致谢。

[更新]

最后，我使用multi_field，为精确查询保留一个原始字段。当我插入时，我对数据使用了一些自定义修改，在搜索过程中，我对输入文本使用了相同的修改例程。这部分不由 Elasticsearch 处理。如果你想这样做，你还必须设计合适的分析器。

索引设置和映射查询：

PUT test_index

POST test_index/_close

PUT test_index/_settings
{
  "index": {
    "analysis": {
      "analyzer": {
        "standard_uppercase": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "keyword",
          "filter": ["uppercase"]
        }
      }
    }
  }
}

PUT test_index/doc/_mapping
{
  "doc": {
     "properties": {
        "text_field": {
           "type": "string",
           "fields": {
              "raw": {
                 "type": "string",
                 "analyzer": "standard_uppercase"
              }
           }
        }
     }
  }
}

POST test_index/_open

插入一些示例数据：

POST test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"super duper cool pizza"}
{"index":{"_id":2}}
{"text_field":"some other text"}
{"index":{"_id":3}}
{"text_field":"pizza"}

准确查询：

GET test_index/doc/_search
{
  "query": {
    "bool": {
      "must": {
        "bool": {
          "should": {
            "term": {
             "text_field.raw": "PIZZA"
            }
          }
        }
      }
    }
  }
}

回复：

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1.4054651,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 1.4054651,
            "_source": {
               "text_field": "pizza"
            }
         }
      ]
   }
}

部分查询：

GET test_index/doc/_search
{
  "query": {
    "bool": {
      "must": {
        "bool": {
          "should": {
            "match": {
              "text_field": {
                "query": "pizza",
                "operator": "AND",
                "type": "boolean"
              }
            }
          }
        }
      }
    }
  }
}

回复：

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 1,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 1,
            "_source": {
               "text_field": "pizza"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.5,
            "_source": {
               "text_field": "super duper cool pizza"
            }
         }
      ]
   }
}

PS：这些是生成的查询，这就是为什么会有一些冗余块，因为会有许多其他字段连接到查询中。

可悲的是，现在我需要再次重写整个映射:(

score 5 · Accepted Answer

我认为这会做你想要的（或至少尽可能接近），使用关键字标记器和小写标记过滤器：

PUT /test_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "lowercase_analyzer": {
               "type": "custom",
               "tokenizer": "keyword",
               "filter": ["lowercase_token_filter"]
            }
         },
         "filter": {
            "lowercase_token_filter": {
               "type": "lowercase"
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "fields": {
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  },
                  "lowercase": {
                     "type": "string",
                     "analyzer": "lowercase_analyzer"
                  }
               }
            }
         }
      }
   }
}

我添加了几个文档进行测试：

POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"super duper cool pizza"}
{"index":{"_id":2}}
{"text_field":"some other text"}
{"index":{"_id":3}}
{"text_field":"pizza"}

请注意，我们有要由标准分析器分析的外部集，然后是text_field一个子字段（您可能不想要这个，我只是为了比较而添加它），以及另一个创建与输入完全相同的标记的子字段文本，但它们已小写（但未在空格上拆分）。所以这个查询返回你所期望的：rawnot_analyzedlowercasematch

POST /test_index/_search
{
    "query": {
        "match": {
           "text_field.lowercase": "Super Duper COOL PIzza"
        }
    }
}
...
{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.30685282,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.30685282,
            "_source": {
               "text_field": "super duper cool pizza"
            }
         }
      ]
   }
}

请记住，match查询也将使用字段的分析器来处理搜索短语，因此在这种情况下，搜索"super duper cool pizza"的效果与搜索的效果完全相同（如果您想要完全匹配，"Super Duper COOL PIzza"您仍然可以使用查询）。term

查看三个文档在每个字段中生成的术语很有用，因为这是您的搜索查询将针对的内容（在这种情况下raw并且lowercase具有相同的标记，但这只是因为所有输入都是小写的已经）：

POST /test_index/_search
{
   "size": 0,
   "aggs": {
      "text_field_standard": {
         "terms": {
            "field": "text_field"
         }
      },
      "text_field_raw": {
         "terms": {
            "field": "text_field.raw"
         }
      },
      "text_field_lowercase": {
         "terms": {
            "field": "text_field.lowercase"
         }
      }
   }
}
...{
   "took": 26,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "text_field_raw": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "pizza",
               "doc_count": 1
            },
            {
               "key": "some other text",
               "doc_count": 1
            },
            {
               "key": "super duper cool pizza",
               "doc_count": 1
            }
         ]
      },
      "text_field_lowercase": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "pizza",
               "doc_count": 1
            },
            {
               "key": "some other text",
               "doc_count": 1
            },
            {
               "key": "super duper cool pizza",
               "doc_count": 1
            }
         ]
      },
      "text_field_standard": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "pizza",
               "doc_count": 2
            },
            {
               "key": "cool",
               "doc_count": 1
            },
            {
               "key": "duper",
               "doc_count": 1
            },
            {
               "key": "other",
               "doc_count": 1
            },
            {
               "key": "some",
               "doc_count": 1
            },
            {
               "key": "super",
               "doc_count": 1
            },
            {
               "key": "text",
               "doc_count": 1
            }
         ]
      }
   }
}

这是我用来测试的代码：

http://sense.qbox.io/gist/cc7564464cec88dd7f9e6d9d7cfccca2f564fde1

如果您还想做部分单词匹配，我建议您看看 ngrams。我在这里写了一个关于 Qbox 的介绍：

https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch

elasticsearch - 如何对 Elasticsearch 中的分析字段执行完全匹配查询？

1 回答 1

Related

Reference