我有像 SimpleDoc000155/1 这样的标题(字符数不固定,但后面总是跟着 9 个数字和/和数字),我想知道如何分析这些文件以获得结果:155 和 SimpleDoc000155。
Elasticsearch 是 2.2 版本
我当前的设置是:
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "autocomplete",
"filter" : [ "code", "lowercase" ]
}
},
"filter": {
"code": {
"type": "pattern_capture",
"preserve_original" : 1,
"patterns": ["([1-9].+(?=\/))"]
}
},
"tokenizer" : {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 6,
"max_gram": 32,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
我得到的结果是
{
"tokens": [{
"token": "simple",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "simpled",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "simpledo",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 2
},
{
"token": "simpledoc",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "simpledoc0",
"start_offset": 0,
"end_offset": 10,
"type": "word",
"position": 4
},
{
"token": "simpledoc00",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 5
},
{
"token": "simpledoc000",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 6
},
{
"token": "simpledoc0001",
"start_offset": 0,
"end_offset": 13,
"type": "word",
"position": 7
},
{
"token": "simpledoc00015",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 8
},
{
"token": "simpledoc000155",
"start_offset": 0,
"end_offset": 15,
"type": "word",
"position": 9
}
]
}
我有点失落。尝试了很多,但我无法恢复 155,看起来 pattern_capture 工作不正常。
谢谢您的回答!
更新:
将标记器从 Edgengram 更改为 ngram,有点工作,但有很多不需要的标记。