1

我有像 SimpleDoc000155/1 这样的标题(字符数不固定,但后面总是跟着 9 个数字和/和数字),我想知道如何分析这些文件以获得结果:155 和 SimpleDoc000155。

Elasticsearch 是 2.2 版本

我当前的设置是:

"analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "autocomplete",
          "filter" : [ "code", "lowercase" ]
                }
            },
            "filter": {
                "code": {
                    "type": "pattern_capture",
          "preserve_original" : 1,
                    "patterns": ["([1-9].+(?=\/))"]
                }
            },
      "tokenizer" : {
      "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 6,
          "max_gram": 32,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
        }
    }

我得到的结果是

{
    "tokens": [{
            "token": "simple",
            "start_offset": 0,
            "end_offset": 6,
            "type": "word",
            "position": 0
        },
        {
            "token": "simpled",
            "start_offset": 0,
            "end_offset": 7,
            "type": "word",
            "position": 1
        },
        {
            "token": "simpledo",
            "start_offset": 0,
            "end_offset": 8,
            "type": "word",
            "position": 2
        },
        {
            "token": "simpledoc",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 3
        },
        {
            "token": "simpledoc0",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 4
        },
        {
            "token": "simpledoc00",
            "start_offset": 0,
            "end_offset": 11,
            "type": "word",
            "position": 5
        },
        {
            "token": "simpledoc000",
            "start_offset": 0,
            "end_offset": 12,
            "type": "word",
            "position": 6
        },
        {
            "token": "simpledoc0001",
            "start_offset": 0,
            "end_offset": 13,
            "type": "word",
            "position": 7
        },
        {
            "token": "simpledoc00015",
            "start_offset": 0,
            "end_offset": 14,
            "type": "word",
            "position": 8
        },
        {
            "token": "simpledoc000155",
            "start_offset": 0,
            "end_offset": 15,
            "type": "word",
            "position": 9
        }
    ]
}

我有点失落。尝试了很多,但我无法恢复 155,看起来 pattern_capture 工作不正常。

谢谢您的回答!

更新:

将标记器从 Edgengram 更改为 ngram,有点工作,但有很多不需要的标记。

4

0 回答 0