1

我对 ElasticSearch 语言分析器有疑问。我正在研究立陶宛语,所以我正在使用立陶宛语分析器。分析器工作正常,我得到了我需要的所有单词大小写。例如,我索引立陶宛城市“克莱佩达”:

PUT /cities/city/1
{
  "name": "Klaipėda"
}

问题是,当我仅在拉丁字母(“Klaipeda”)和所有立陶宛语案例中搜索“Klaipeda”时,我还需要得到一个结果:

  1. 主格:“克莱佩达”
  2. 属格:“克莱佩多斯”
  3. ...
  4. 定位案例:“克莱佩多耶”

“克莱佩达”、“克莱佩多斯”、“克莱佩多耶” - 有效,但“克莱佩达”、“克莱佩多斯”、“克莱佩多耶” - 不奏效。

我的索引:

PUT /cities
{
  "mappings": {
    "city": {
      "properties": {
        "name": {
          "type":     "string",
          "analyzer": "lithuanian",
            "fields": {
              "folded": {
              "type": "string",
              "analyzer": "md_folded_analyzer"
             }
           }
        }
      }
    }
  },
  "settings": {
      "analysis": {
        "analyzer": {
          "md_folded_analyzer": {
            "type": "lithuanian",
            "tokenizer": "standard",
            "filter":  [ 
              "lowercase", 
              "asciifolding",
              "lithuanian_stop",
              "lithuanian_keywords",
              "lithuanian_stemmer"
            ]
          }
        }
     }
  }
}

和搜索查询:

GET /cities/_search
{
  "query": {
    "multi_match" : {
      "type":     "most_fields",
      "query":    "klaipeda", 
      "fields": [ "name", "name.folded" ]
    }
  }
}

我做错了什么?感谢帮助。

4

1 回答 1

2

您在这里使用的技术是所谓的multi-fields。基础name.folded字段的限制是您无法对其执行搜索 - 您只能执行排序name.folded和聚合。

为了解决这个问题,我提出了以下设置:

  1. 单独的字段设置(以消除重复 - 只需指定copy_to):

    curl -XPUT http://localhost:9200/cities -d '
    {
      "mappings": {
        "city": {
          "properties": {
            "name": {
              "type":     "string",
              "analyzer": "lithuanian",
              "copy_to": "folded",
            },
            "folded": {
              "type": "string",
              "analyzer": "md_folded_analyzer"
            }
          }
        }
      }
    }'
    
  2. 将分析器的类型更改为此处custom描述的类型,否则不会进入配置。更重要的是 -应该在立陶宛语中使用所有词干/停用词,因为折叠后单词可能会错过所需的意义。asciifoldingasciifolding

    curl -XPUT http://localhost:9200/my_cities -d '
    {
      "settings": {
          "analysis": {
            "filter": {
              "lithuanian_stop": {
                "type":       "stop",
                "stopwords":  "_lithuanian_"
              },
              "lithuanian_stemmer": {
                "type":       "stemmer",
                "language":   "lithuanian"
              }
            },
            "analyzer": {
              "md_folded_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter":  [
                  "lowercase",
                  "lithuanian_stop",
                  "lithuanian_stemmer",
                  "asciifolding"
                ]
              }
            }
         }
      }
    }
    

    对不起,我已经消除了lithuanian_keywords- 它需要额外的设置,我在这里错过了。但我希望你有这个想法。

于 2017-03-14T16:44:00.720 回答