8

我在 ES 数据库中有一堆公司数据。我正在寻找每家公司出现多少文档的计数,但我在aggregation查询时遇到了一些问题。我希望排除诸如“公司”或“公司”之类的术语。到目前为止,我已经能够按照下面的代码一次成功地完成一个学期。

{
    "aggs" : {
        "companies" : {
            "terms" : {
                "field" : "Companies.name",
                "exclude" : "corporation"
            }
        }
    }
}

哪个返回

"aggregations": {
    "assignee": {
         "buckets": [
            {
               "key": "inc",
               "doc_count": 375
            },
            {
               "key": "company",
               "doc_count": 252
            }
         ]
     }
}

理想情况下,我希望能够做类似的事情

{
    "aggs" : {
        "companies" : {
            "terms" : {
                "field" : "Companies.name",
                "exclude" : ["corporation", "inc.", "inc", "co", "company", "the", "industries", "incorporated", "international"],
            }
        }
    }
}

但我一直无法找到一种不会引发错误的方法

我查看了 ES 文档中聚合的“术语”部分,只能找到单个排除的示例。我想知道是否可以排除多个术语,如果可以,这样做的正确语法是什么。

注意:我知道我可以将该字段设置为“not_analyzed”并获取完整公司名称的分组,而不是拆分名称。但是,我犹豫要不要这样做,因为分析允许存储桶更容忍名称变化(即 Microsoft Corp 和 Microsoft Corporation)

4

2 回答 2

11

The exclude parameter is a regular expression, so you could use a regular expression that exhaustively lists all choices:

"exclude" :
    "corporation|inc\\.|inc|co|company|the|industries|incorporated|international"

Doing this generically, it's important to escape values (e.g., .). If it is not generically generated, then you could simplify some of these by grouping them (e.g., inc\\.? covers inc\\.|inc, or the more complicated: co(mpany|rporation)?). If this is going to run a lot, then it's probably worth testing how the added complexity effects performance.

There are also optional flags that can be applied, which are the options that exist in Java Pattern. The one that might come in handy is CASE_INSENSITIVE.

"exclude" : {
    "pattern" : "...expression as before...",
    "flags" : "CASE_INSENSITIVE"
}
于 2014-04-02T04:42:01.687 回答
0

这是老问题,但更新的答案:当前支持exclude列表项精确匹配的数组

因此 OP 中的数组语法现在有效并且按预期工作(除了有效的正则表达式答案之外)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_exact_values

于 2017-10-06T14:40:42.430 回答