4

I'm aiming to build an index that, for each document, will break it down by word ngrams (uni, bi, and tri), then capture term vector analysis on all of those word ngrams. Is that possible with Elasticsearch?

For instance, for a document field containing "The red car drives." I would be able to get the information:

red - 1 instance
car - 1 instance
drives - 1 instance
red car - 1 instance
car drives - 1 instance
red car drives - 1 instance

Thanks in advance!

4

1 回答 1

7

假设您已经了解Term Vectors api,您可以在索引时应用shingle 令牌过滤器,以在令牌流中将这些术语添加为彼此独立。

设置min_shingle_size为 1(而不是默认的 2),并且max_shingle_size至少设置为 3(而不是默认的 2)

并且基于您将“the”排除在可能的术语之外的事实,您应该在应用 shingles 过滤器之前使用停用词过滤器。

分析器设置将是这样的:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "evolutionAnalyzer": {
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "custom_stop",
            "custom_shingle"
          ]
        }
      },
      "filter": {
        "custom_stop": {
            "type": "stop",
            "stopwords": "_english_",
            "enable_position_increments":"false"
        },
        "custom_shingle": {
            "type": "shingle",
            "min_shingle_size": "1",
            "max_shingle_size": "3"
        }
      }
    }
  }
}

_analyze您可以使用api 端点测试分析器。

于 2014-12-10T02:12:21.857 回答