elasticsearch - 在 Elasticsearch 中搜索带连字符的文本

Question

我在弹性搜索中存储了一个“付款参考号”。

它的布局是例如：2-4-3-635844569819109531或2-4-2-635844533758635433等

我希望能够通过他们的付款参考号搜索文件

使用“整个”参考号进行搜索，例如输入2-4-2-635844533758635433
从“开始”开始的参考编号的任何“部分”。例如2-4-2-63（.. 所以只返回示例中的第二个）

注意：我不想从头开始搜索“中间”或“结尾”等。

无论如何，连字符让我感到困惑。

问题

1）我不确定是否应该在映射中删除它们，例如

"char_filter" : {
    "removeHyphen" : {
        "type" : "mapping",
            "mappings" : ["-=>"]
        }
    },

或不。我从来没有以这种方式使用过映射，所以不确定这是否有必要。

2）我认为我需要一个“ngrams”过滤器，因为我希望能够从存在中搜索参考号的一部分。我想像

"partial_word":{
    "filter":[
        "standard",
            "lowercase",
            "name_ngrams"
        ],
    "type":"custom",
    "tokenizer":"whitespace"
},

和过滤器

"name_ngrams":{
    "side":"front",
        "max_gram":50,
        "min_gram":2,
    "type":"edgeNGram"
},

我不知道如何把它们放在一起，但是

"paymentReference":{
    "type":"string",
    "analyzer": "??",
    "fields":{
        "partial":{
            "search_analyzer":"???",
            "index_analyzer":"partial_word",
            "type":"string"
        }
    }
}

在第二个搜索案例中，我尝试过的所有东西似乎总是“中断”。

如果我这样做'localhost:9200/orders/_analyze?field=paymentReference&pretty=1' -d "2-4-2-635844533758635433"，它总是打破连字符作为它自己的标记并返回例如所有2-“很多”的文档！而不是我在搜索时想要的2-4-2-6

有人可以告诉我如何将此字段映射到我要实现的两种类型的搜索吗？

更新 - 答案

实际上是瓦尔在下面所说的。我只是稍微更改了映射以更具体地分析分析器，而且我不需要索引主字符串，因为我只查询部分。

映射

"paymentReference":{
    "type": "string",
    "index":"not_analyzed",
    "fields": {
        "partial": {
            "search_analyzer":"payment_ref",
            "index_analyzer":"payment_ref",
            "type":"string"
        }
    }
}

分析仪

"payment_ref": {
    "type": "custom",
    "filter": [
        "lowercase",
        "name_ngrams"
    ],
    "tokenizer": "keyword"
}

筛选

"name_ngrams":{
    "side":"front",
    "max_gram":50,
    "min_gram":2,
    "type":"edgeNGram"
},

score 0 · Accepted Answer

您不需要为此使用映射字符过滤器。

使用 Edge NGram 令牌过滤器您走在正确的轨道上，因为您只需要能够搜索前缀。我会使用keyword分词器来确保将术语作为一个整体来理解。所以设置的方法是这样的：

curl -XPUT localhost:9200/orders -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "partial_word": {
          "type": "custom",
          "filter": [
            "lowercase",
            "ngram_filter"
          ],
          "tokenizer": "keyword"
        }
      },
      "filter": {
        "ngram_filter": {
          "type": "edgeNGram",
          "min_gram": 2,
          "max_gram": 50
        }
      }
    }
  },
  "mappings": {
    "order": {
      "properties": {
        "paymentReference": {
          "type": "string",
          "fields": {
            "partial": {
              "analyzer": "partial_word",
              "type": "string"
            }
          }
        }
      }
    }
  }
}'

然后您可以分析将要索引到您的paymentReference.partial字段中的内容：

curl -XGET 'localhost:9205/payments/_analyze?field=paymentReference.partial&pretty=1' -d "2-4-2-635844533758635433"

你得到你想要的，即所有的前缀：

{
  "tokens" : [ {
    "token" : "2-",
    "start_offset" : 0,
    "end_offset" : 24,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2-4",
    "start_offset" : 0,
    "end_offset" : 24,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2-4-",
    "start_offset" : 0,
    "end_offset" : 24,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2-4-2",
    "start_offset" : 0,
    "end_offset" : 24,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2-4-2-",
    "start_offset" : 0,
    "end_offset" : 24,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2-4-2-6",
    "start_offset" : 0,
    "end_offset" : 24,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2-4-2-63",
    "start_offset" : 0,
    "end_offset" : 24,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2-4-2-635",
    "start_offset" : 0,
    "end_offset" : 24,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2-4-2-6358",
    "start_offset" : 0,
    "end_offset" : 24,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2-4-2-63584",
    "start_offset" : 0,
    "end_offset" : 24,
    "type" : "word",
    "position" : 1
  }, {
  ...

最后，您可以搜索任何前缀：

curl -XGET localhost:9200/orders/order/_search?q=paymentReference.partial:2-4-3

score 0 · Accepted Answer

不确定通配符搜索是否符合您的需求。我定义自定义过滤器并设置 preserve_original 并生成数字部分为假。这是示例代码：

PUT test1
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "myAnalyzer" : {
          "type" : "custom",
          "tokenizer" : "whitespace",
          "filter" : [ "dont_split_on_numerics" ]
        }
      },
      "filter" : {
        "dont_split_on_numerics" : {
          "type" : "word_delimiter",
          "preserve_original": true,
          "generate_number_parts" : false
        }
      }
    }
  },
  "mappings": {
    "type_one": {
      "properties": {
        "title": { 
          "type": "text",
          "analyzer": "standard" 
        }
      }
    },
    "type_two": {
      "properties": {
        "raw": { 
          "type": "text",
          "analyzer": "myAnalyzer" 
        }
      }
    }
  }
}

POST test1/type_two/1
{
  "raw": "2-345-6789" 
}

GET test1/type_two/_search
{
  "query": {
    "wildcard": {
      "raw": "2-345-67*" 
    }
  }
}

elasticsearch - 在 Elasticsearch 中搜索带连字符的文本

2 回答 2

Related

Reference