elasticsearch - 自定义分析器，用例：邮政编码 [ElasticSearch]

Question

让我们成为一个名为客户/客户的集合索引/类型。该集合的每个文档都有一个邮政编码作为属性。基本上，邮政编码可以是：

字符串-字符串（例如：8907-1009）
字符串字符串（例如：211-20）
字符串（例如：30200）

我想设置我的索引分析器以获取尽可能多的可以匹配的文档。目前，我是这样工作的：

PUT /customers/
{
"mappings":{
    "customer":{
        "properties":{
             "zip-code": {
                  "type":"string"
                  "index":"not_analyzed"
              }
              some string properties ...
         }
     }
 }

当我搜索文档时，我正在使用该请求：

GET /customers/customer/_search
{
  "query":{
    "prefix":{
      "zip-code":"211-20"
     }
   }
}

如果您想严格搜索，这很有效。但是例如，如果邮政编码是“200 30”，那么使用“200-30”搜索将不会给出任何结果。我想向我的索引分析器发出命令，以免出现这个问题。有人能帮我吗？谢谢。

PS如果您想了解更多信息，请告诉我；）

score 2 · Accepted Answer

只要您想找到不想使用的变体not_analyzed.

让我们用不同的映射试试这个：

PUT zip
{
  "settings": {
    "number_of_shards": 1, 
    "analysis": {
      "analyzer": {
        "zip_code": {
          "tokenizer": "standard",
          "filter": [ ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "zip": {
          "type": "text",
          "analyzer": "zip_code"
        }
      }
    }
  }
}

我们正在使用标准的分词器；字符串将在空格和标点符号（包括破折号）处分解为标记。如果您运行以下查询，您可以看到实际的令牌：

POST zip/_analyze
{
  "analyzer": "zip_code",
  "text": ["8907-1009", "211-20", "30200"]
}

添加您的示例：

POST zip/_doc
{
  "zip": "8907-1009"
}
POST zip/_doc
{
  "zip": "211-20"
}
POST zip/_doc
{
  "zip": "30200"
}

现在查询似乎工作正常：

GET zip/_search
{
  "query": {
    "match": {
      "zip": "211-20"
    }
  }
}

如果您只搜索“211”，这也将起作用。但是，这可能太宽松了，因为它还会找到“20”、“20-211”、“211-10”...

您可能想要的是短语搜索，其中查询中的所有标记都需要在字段中并且顺序正确：

GET zip/_search
{
  "query": {
    "match_phrase": {
      "zip": "211"
    }
  }
}

添加：

如果邮政编码具有层次含义（如果您有“211-20”，则希望在搜索“211”时找到它，但在搜索“20”时不希望找到它），您可以使用path_hierarchytokenizer。

因此将映射更改为：

PUT zip
{
  "settings": {
    "number_of_shards": 1, 
    "analysis": {
      "analyzer": {
        "zip_code": {
          "tokenizer": "zip_tokenizer",
          "filter": [ ]
        }
      },
      "tokenizer": {
        "zip_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "-"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "zip": {
          "type": "text",
          "analyzer": "zip_code"
        }
      }
    }
  }
}

使用上面相同的 3 个文档，您match现在可以使用查询：

GET zip/_search
{
  "query": {
    "match": {
      "zip": "1009"
    }
  }
}

“1009”不会找到任何东西，但“8907”或“8907-1009”会。

如果您还想找到“1009”，但分数较低，则必须使用我显示的两种变体分析邮政编码（结合映射的 2 个版本）：

PUT zip
{
  "settings": {
    "number_of_shards": 1, 
    "analysis": {
      "analyzer": {
        "zip_hierarchical": {
          "tokenizer": "zip_tokenizer",
          "filter": [ ]
        },
          "zip_standard": {
          "tokenizer": "standard",
          "filter": [ ]
        }
      },
      "tokenizer": {
        "zip_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "-"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "zip": {
          "type": "text",
          "analyzer": "zip_standard",
          "fields": {
            "hierarchical": {
              "type": "text",
              "analyzer": "zip_hierarchical"
            }
          }
        }
      }
    }
  }
}

添加具有相反顺序的文档以正确测试它：

POST zip/_doc
{
  "zip": "1009-111"
}

然后搜索这两个字段，但使用分层标记器将一个提升 3：

GET zip/_search
{
  "query": {
    "multi_match" : {
      "query" : "1009",
      "fields" : [ "zip", "zip.hierarchical^3" ] 
    }
  }
}

然后你可以看到“1009-111”的分数比“8907-1009”高很多。

elasticsearch - 自定义分析器，用例：邮政编码 [ElasticSearch]

1 回答 1

Related

Reference