5

We are having some difficulty on figuring out how to best manage our tokenized and untokenized fields for both searching and sorting. Our goals are pretty straightforward:

  1. Support Partial word searches
  2. Support Sorting on all all fields
  3. Our mapping must be dynamic, customers add new fields at runtime.

We're able to accomplish this using a dynamic template. We Store Strings using the default tokenizer, a custom, ngram tokenizer, and an unanalyzed tokenizer. The mapping:

curl -XPUT 'http://testServer:9200/test/' -d '{
        "settings": {
            "analysis": {
                "analyzer": {
                    "my_ngram_analyzer": {
                        "tokenizer": "my_ngram_tokenizer",
                        "filter": [
                            "lowercase"
                        ],
                        "type" : "custom"
                    },
                    "default_search": {
                        "tokenizer" : "keyword",
                        "filter" : [
                            "lowercase"
                        ]
                    }
                },
                "tokenizer": {
                    "my_ngram_tokenizer": {
                        "type": "nGram",
                        "min_gram": "3",
                        "max_gram": "100",
                        "token_chars": []
                    }
                }
            }
        },
        "mappings": {
            "TestObject": {
                "dynamic_templates": [
                    {
                        "metadata_template": {
                            "match_mapping_type": "string",
                            "path_match": "*",
                            "mapping": {
                                "type": "multi_field",
                                "fields": {
                                    "ngram": {
                                        "type": "{dynamic_type}",
                                        "index": "analyzed",
                                        "index_analyzer": "my_ngram_analyzer",
                                        "search_analyzer" : "default_search"
                                    },
                                    "{name}": {
                                        "type": "{dynamic_type}",
                                        "index": "analyzed",
                                        "index_analyzer" : "standard",
                                        "search_analyzer" : "default_search"
                                    },
                                    "sortable": {
                                        "type": "{dynamic_type}",
                                        "index": "analyzed",
                                        "analyzer" : "default_search"
                                    }
                                }
                            }
                        }
                    }
                ]
            }
        }
    }'

We're really only keeping the unanalyzed field around for sorting and exact matches (We even call it, 'sortable'. ) This configuration makes it easy for us to get partial word searches, if the query is a "contains" query- we append ".ngram" to the query target. The problem that we are having is deciding when to use the ".sortable" suffix. If the we receive a request to sort on dateUpdated, for example, we don't want to use .sortable, since that field is a date. If The request is to sort on 'name', we do want to use it, since that field is a string, and not use it if we are trying to sort on 'price'.

The logic to check the type of a field before sorting seems a little kludgy (we check in our model, rather than checking the type in elasticsearch).It would be nice to ALWAYS have a '.sortable' field around, but we can't run non-string types through the dynamic template below- booleans and numbers can't be run through an ngram filter.

Does anyone have a suggestion for how we can always have a ".sortable" field for sorting, that would never be tokenized regardless of the type? Or maybe you have a better solution for this kind of problem that we're not seeing? Thanks in advance!

4

1 回答 1

6

这真正归结为我们一直希望在每个映射字段上都有一个“可排序”字段(我们将其重命名为“未分析”,因为它还有其他用途)。这样做的真正诀窍是,在不为每种类型添加新的动态模板的情况下,创建一个适用于除字符串以外的所有类型的动态模板。为此,您需要设置match_pattern为正则表达式:

           {
                "other_types": {
                    "match_mapping_type": "date|boolean|double|long|integer",
                    "match_pattern": "regex",
                    "path_match": ".*",
                    "mapping": {
                        "type": "multi_field",
                        "fields": {
                            "{name}": {
                                "type": "{dynamic_type}",
                                "index": "not_analyzed"
                            },
                            "unanalyzed": {
                                "type": "{dynamic_type}",
                                "index": "not_analyzed"
                            }
                        }
                    }
                }
            } 

请注意,您还需要对“path_match”进行小幅更改——您必须使用真正的正则表达式(而不是 '*',它是一个 ES 'simple' 表达式。)

这样做的一个缺点是我们增加了索引的大小——我们将所有这些类型存储了两次。不过,出于我们的目的,我们的索引(我们有很多)有足够的增长空间,值得避免在进行排序或完全匹配查询之前对每个字段进行类型查找(只是总是使用 $ {fieldName}.unanalyzed)。

于 2013-11-05T16:31:49.137 回答