我们索引了许多可能包含“灯泡 220V”或“盒子 23cm”或“Varta 超级充电电池 74Ah”等标题的文档。然而,我们的用户在搜索时倾向于用空格分隔数字和单位,因此他们搜索“Varta 74 Ah”时并没有得到他们期望的结果。以上是对问题的简化,但主要问题希望是有效的。如何分析“Varta Super-charge Battery 74Ah”以便(在其他令牌之上)74
,Ah
并74Ah
创建?
谢谢,
迈克尔
我们索引了许多可能包含“灯泡 220V”或“盒子 23cm”或“Varta 超级充电电池 74Ah”等标题的文档。然而,我们的用户在搜索时倾向于用空格分隔数字和单位,因此他们搜索“Varta 74 Ah”时并没有得到他们期望的结果。以上是对问题的简化,但主要问题希望是有效的。如何分析“Varta Super-charge Battery 74Ah”以便(在其他令牌之上)74
,Ah
并74Ah
创建?
谢谢,
迈克尔
您需要创建一个自定义分析器来实现Ngram Tokenizer
,然后将其应用于text
您创建的字段。
以下是示例映射、文档、查询和响应:
PUT my_split_index
{
"settings": {
"index":{
"max_ngram_diff": 3
},
"analysis": {
"analyzer": {
"my_analyzer": { <---- Custom Analyzer
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"product":{
"type": "text",
"analyzer": "my_analyzer", <--- Note this as how custom analyzer is applied on this field
"fields": {
"keyword":{
"type": "keyword"
}
}
}
}
}
}
您正在寻找的功能称为Ngram,它将从单个令牌创建多个令牌。令牌的大小取决于上面提到的 min_ngram 和 max_ngram 设置。
请注意,我提到max_ngram_diff
了 3,这是因为在 7.x 版本中,ES 的默认值为1
. 查看您的用例,我将其创建为3
This value is nothing but max_ngram
- min_ngram
。
POST my_split_index/_doc/1
{
"product": "Varta 74 Ah"
}
POST my_split_index/_doc/2
{
"product": "lightbulb 220V"
}
POST my_split_index/_search
{
"query": {
"match": {
"product": "74Ah"
}
}
}
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.7029606,
"hits" : [
{
"_index" : "my_split_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.7029606,
"_source" : {
"product" : "Varta 74 Ah"
}
}
]
}
}
要了解实际生成的令牌,您可以使用以下分析 API:
POST my_split_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Varta 74 Ah"
}
您可以看到,当我执行上述 API 时,生成了以下令牌:
{
"tokens" : [
{
"token" : "Va",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "Var",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "Vart",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "Varta",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 3
},
{
"token" : "ar",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 4
},
{
"token" : "art",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 5
},
{
"token" : "arta",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 6
},
{
"token" : "rt",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 7
},
{
"token" : "rta",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 8
},
{
"token" : "ta",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 9
},
{
"token" : "74",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 10
},
{
"token" : "Ah",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 11
}
]
}
请注意,我在本Query Request
节中提到的查询是74Ah
,但它仍然返回文档。这是因为 ES 在索引时间和搜索时间两次应用分析器。默认情况下,如果您未search_analyzer
在查询中指定 ,那么您在索引期间应用的分析器也会在查询期间应用。
希望这可以帮助!
我想这会对你有所帮助:
PUT index_name
{
"settings": {
"analysis": {
"filter": {
"custom_filter": {
"type": "word_delimiter",
"split_on_numerics": true
}
},
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["custom_filter"]
}
}
}
}
}
您可以split_on_numerics
在自定义过滤器中使用属性。这将为您提供以下响应:
邮政
POST /index_name/_analyze
{
"analyzer": "custom_analyzer",
"text": "Varta Super-charge battery 74Ah"
}
回复
{
"tokens" : [
{
"token" : "Varta",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "Super",
"start_offset" : 6,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "charge",
"start_offset" : 12,
"end_offset" : 18,
"type" : "word",
"position" : 2
},
{
"token" : "battery",
"start_offset" : 19,
"end_offset" : 26,
"type" : "word",
"position" : 3
},
{
"token" : "74",
"start_offset" : 27,
"end_offset" : 29,
"type" : "word",
"position" : 4
},
{
"token" : "Ah",
"start_offset" : 29,
"end_offset" : 31,
"type" : "word",
"position" : 5
}
]
}
正如您在问题中提到的,您可以如下定义索引映射并查看它生成的令牌。此外,它不会创建很多令牌。因此,您的索引的大小会更小。
{
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "word_delimiter",
"split_on_numerics": "true",
"catenate_words": "true",
"preserve_original": "true"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"my_filter",
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
_analyze
并检查使用API生成的令牌 {
"text": "Varta Super-charge battery 74Ah",
"analyzer" : "my_analyzer"
}
{
"tokens": [
{
"token": "varta",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "super-charge",
"start_offset": 6,
"end_offset": 18,
"type": "word",
"position": 1
},
{
"token": "super",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "supercharge",
"start_offset": 6,
"end_offset": 18,
"type": "word",
"position": 1
},
{
"token": "charge",
"start_offset": 12,
"end_offset": 18,
"type": "word",
"position": 2
},
{
"token": "battery",
"start_offset": 19,
"end_offset": 26,
"type": "word",
"position": 3
},
{
"token": "74ah",
"start_offset": 27,
"end_offset": 31,
"type": "word",
"position": 4
},
{
"token": "74",
"start_offset": 27,
"end_offset": 29,
"type": "word",
"position": 4
},
{
"token": "ah",
"start_offset": 29,
"end_offset": 31,
"type": "word",
"position": 5
}
]
}
编辑:彼此生成的令牌在第一眼看起来可能相同,但我确保它满足您的所有要求,给出的问题和生成的令牌在仔细检查时完全不同,详细信息如下:
74ah
和的标记supercharge
,问题中提到了这一点,我的分析器也提供了这些标记。