elasticsearch - 拆分文本包含分成 3 个令牌

Question

我们索引了许多可能包含“灯泡 220V”或“盒子 23cm”或“Varta 超级充电电池 74Ah”等标题的文档。然而，我们的用户在搜索时倾向于用空格分隔数字和单位，因此他们搜索“Varta 74 Ah”时并没有得到他们期望的结果。以上是对问题的简化，但主要问题希望是有效的。如何分析“Varta Super-charge Battery 74Ah”以便（在其他令牌之上）74，Ah并74Ah创建？

谢谢，

迈克尔

score 0 · Accepted Answer

您需要创建一个自定义分析器来实现Ngram Tokenizer，然后将其应用于text您创建的字段。

以下是示例映射、文档、查询和响应：

映射：

PUT my_split_index
{
  "settings": {
    "index":{
      "max_ngram_diff": 3
    },
    "analysis": {
      "analyzer": { 
        "my_analyzer": {                     <---- Custom Analyzer
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 5,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "product":{
        "type": "text",
        "analyzer": "my_analyzer",       <--- Note this as how custom analyzer is applied on this field
        "fields": {
          "keyword":{
            "type": "keyword"
          }
        }
      }
    }
  }
}

您正在寻找的功能称为Ngram，它将从单个令牌创建多个令牌。令牌的大小取决于上面提到的 min_ngram 和 max_ngram 设置。

请注意，我提到max_ngram_diff了 3，这是因为在 7.x 版本中，ES 的默认值为1. 查看您的用例，我将其创建为3This value is nothing but max_ngram- min_ngram。

样本文件：

POST my_split_index/_doc/1
{
  "product": "Varta 74 Ah"
}

POST my_split_index/_doc/2
{
  "product": "lightbulb 220V"
}

查询请求：

POST my_split_index/_search
{
  "query": {
    "match": {
      "product": "74Ah"
    }
  }
}

回复：

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.7029606,
    "hits" : [
      {
        "_index" : "my_split_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.7029606,
        "_source" : {
          "product" : "Varta 74 Ah"
        }
      }
    ]
  }
}

附加信息：

要了解实际生成的令牌，您可以使用以下分析 API：

POST my_split_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Varta 74 Ah"
}

您可以看到，当我执行上述 API 时，生成了以下令牌：

{
  "tokens" : [
    {
      "token" : "Va",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Var",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Vart",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "Varta",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "ar",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "art",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "arta",
      "start_offset" : 1,
      "end_offset" : 5,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "rt",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "rta",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "ta",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "74",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "Ah",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "word",
      "position" : 11
    }
  ]
}

请注意，我在本Query Request节中提到的查询是74Ah，但它仍然返回文档。这是因为 ES 在索引时间和搜索时间两次应用分析器。默认情况下，如果您未search_analyzer在查询中指定，那么您在索引期间应用的分析器也会在查询期间应用。

希望这可以帮助！

score 0 · Accepted Answer

我想这会对你有所帮助：

PUT index_name
{
  "settings": {
    "analysis": {
      "filter": {
        "custom_filter": {
          "type": "word_delimiter",
          "split_on_numerics": true
        }
      },
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["custom_filter"]
        }
      }
    }
  }
}

您可以split_on_numerics在自定义过滤器中使用属性。这将为您提供以下响应：

邮政

POST /index_name/_analyze
{
  "analyzer": "custom_analyzer",
  "text": "Varta Super-charge battery 74Ah"
}

{
  "tokens" : [
    {
      "token" : "Varta",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Super",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "charge",
      "start_offset" : 12,
      "end_offset" : 18,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "battery",
      "start_offset" : 19,
      "end_offset" : 26,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "74",
      "start_offset" : 27,
      "end_offset" : 29,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "Ah",
      "start_offset" : 29,
      "end_offset" : 31,
      "type" : "word",
      "position" : 5
    }
  ]
}

score 0 · Accepted Answer

正如您在问题中提到的，您可以如下定义索引映射并查看它生成的令牌。此外，它不会创建很多令牌。因此，您的索引的大小会更小。

索引映射

 {
    "settings": {
        "analysis": {
            "filter": {
                "my_filter": {
                    "type": "word_delimiter",
                    "split_on_numerics": "true",
                    "catenate_words": "true",
                    "preserve_original": "true"
                }
            },
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [
                        "my_filter",
                        "lowercase"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }
    }
}

`_analyze`并检查使用API生成的令牌

   {
    "text": "Varta Super-charge battery 74Ah",
    "analyzer" : "my_analyzer"
}

生成的令牌

{
    "tokens": [
        {
            "token": "varta",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 0
        },
        {
            "token": "super-charge",
            "start_offset": 6,
            "end_offset": 18,
            "type": "word",
            "position": 1
        },
        {
            "token": "super",
            "start_offset": 6,
            "end_offset": 11,
            "type": "word",
            "position": 1
        },
        {
            "token": "supercharge",
            "start_offset": 6,
            "end_offset": 18,
            "type": "word",
            "position": 1
        },
        {
            "token": "charge",
            "start_offset": 12,
            "end_offset": 18,
            "type": "word",
            "position": 2
        },
        {
            "token": "battery",
            "start_offset": 19,
            "end_offset": 26,
            "type": "word",
            "position": 3
        },
        {
            "token": "74ah",
            "start_offset": 27,
            "end_offset": 31,
            "type": "word",
            "position": 4
        },
        {
            "token": "74",
            "start_offset": 27,
            "end_offset": 29,
            "type": "word",
            "position": 4
        },
        {
            "token": "ah",
            "start_offset": 29,
            "end_offset": 31,
            "type": "word",
            "position": 5
        }
    ]
}

编辑：彼此生成的令牌在第一眼看起来可能相同，但我确保它满足您的所有要求，给出的问题和生成的令牌在仔细检查时完全不同，详细信息如下：

我生成的令牌都是小写的，以提供不区分大小写的搜索功能，这在所有搜索引擎中都是隐含的。
需要注意的关键是生成为74ah和的标记supercharge，问题中提到了这一点，我的分析器也提供了这些标记。

elasticsearch - 拆分文本包含分成 3 个令牌

3 回答 3

映射：

样本文件：

查询请求：

回复：

附加信息：

索引映射

_analyze并检查使用API生成的令牌

生成的令牌

Related

Reference

`_analyze`并检查使用API生成的令牌