elasticsearch - Elasticsearch - at&t 和 procter&gamble 案例

Question

默认情况下，带有英语分析器的 Elasticsearch 会分解at&t为 tokens at，t然后at作为停用词删除。

POST _analyze
{
  "analyzer": "english", 
  "text": "A word AT&T Procter&Gamble"
}

结果令牌看起来像：

{
  "tokens" : [
    {
      "token" : "word",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "t",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "procter",
      "start_offset" : 12,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "gambl",
      "start_offset" : 20,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

我希望能够精确匹配at&t，同时能够procter&gamble精确搜索并且能够搜索例如 only procter。

所以我想构建一个分析器，它为字符串 at&t和, ,为.tat&tproctergamblprocter&gambleprocter&gamble

有没有办法创建这样的分析器？或者我应该创建 2 个索引字段 - 一个用于常规英语分析器，另一个用于English except tokenization by &？

score 2 · Accepted Answer

映射：您可以对空格进行标记并使用单词分隔符过滤器为 at&t 创建标记

{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_with_acronymns": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "acronymns"
          ]
        }
      },
      "filter": {
        "acronymns": {
          "type": "word_delimiter_graph",
          "catenate_all": true
        }
      }
    }
  }
}

代币：

{
  "analyzer": "whitespace_with_acronymns", 
  "text": "A word AT&T Procter&Gamble"
}

结果： at&t 被标记为 at,t,att，因此您可以通过 at,t 和 at&t 进行搜索。

{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "word",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "at",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "att",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "t",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "procter",
      "start_offset" : 12,
      "end_offset" : 19,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "proctergamble",
      "start_offset" : 12,
      "end_offset" : 26,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "gamble",
      "start_offset" : 20,
      "end_offset" : 26,
      "type" : "word",
      "position" : 5
    }
  ]
}

如果要删除停用词“at”，可以添加停用词过滤器

{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_with_acronymns": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "acronymns",
            "english_possessive_stemmer",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      },
      "filter": {
        "acronymns": {
          "type": "word_delimiter_graph",
          "catenate_all": true
        },
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"] 
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      }
    }
  }
}

elasticsearch - Elasticsearch - at&t 和 procter&gamble 案例

1 回答 1

Related

Reference