3

我在所有维基百科文章名称的巨大数据集上使用弹性搜索它们大约有 500 万个数字数据库字段名称是文章名称

curl -XPUT "http://localhost:9200/index_wiki_articlenames/" -d'
{
   "settings":{
      "analysis":{
         "filter":{
            "nGram_filter":{
               "type":"edgeNGram",
               "min_gram":1,    
               "max_gram":20,
               "token_chars":[
                  "letter",
                  "digit",
                  "punctuation",
                  "symbol"
               ]
            }
         },
         "tokenizer":{
            "edge_ngram_tokenizer":{
               "type":"edgeNGram",
               "min_gram":"1",
               "max_gram":"20",
               "token_chars":[
                  "letter",
                  "digit"
               ]
            }                                                                                                                   
         },
         "analyzer":{
            "nGram_analyzer":{
               "type":"custom",
               "tokenizer":"edge_ngram_tokenizer",
               "filter":[
                  "lowercase",
                  "asciifolding"
               ]
            }
         },
         "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding"
               ]
            }
      }
   },
   "mappings":{                                                                         
      "name":{
         "properties":{
            "articlenames":{
               "type":"text",
               "analyzer":"nGram_analyzer"
            }
         }
      }
   }
}'

参考这些链接也可以解决我的问题,但徒劳无功

带有短语匹配的 Edge NGram

https://hackernoon.com/elasticsearch-building-autocomplete-functionality-494fcf81a7cf

我的目标是为“sachin t”的输入查询获得如下结果

sachin tendulkar
sachin tendulkar centuries
sachin tejas 
sachin top 60 quotes
sachin talwalkar
sachin tawade
sachin taps

并查询“sachin te”

sachin tendulkar
sachin tendulkar centuries
sachin tejas 

并查询“sachin ta”

sachin talwalkar
sachin tawade
sachin taps

并查询“sachin 十”

sachin tendulkar
sachin tendulkar centuries

请记住,数据集非常庞大,一些文章名称和单词可能包含特殊字符和单词,例如“Bronisław-Komorowski”

我能够获得多达 10 万条记录的较小数据集的输出,但是一旦我的数据集更改为 0.5 到 5 百万条记录,我就无法获得输出

我的查询是

http://127.0.0.1:9200/index_wiki_articlenames/_search?&q=articlenames:sachin-t+articlenames:sachin-t.*&filter_path=hits.hits._source.articlenames&size=50
4

2 回答 2

0

您应该尝试以下设置:

curl -XPUT "http://localhost:9200/index_wiki_articlenames/" -d'
{
   "settings":{
      "analysis":{
         "tokenizer":{
            "edge_ngram_tokenizer":{
               "type":"edgeNGram",
               "min_gram":"1",
               "max_gram":"20",
               "token_chars":[
                  "letter",
                  "digit"
               ]
            }                                                                                                                   
         },
         "analyzer":{
            "nGram_analyzer":{
               "type":"custom",
               "tokenizer":"edge_ngram_tokenizer",
               "filter":[
                  "lowercase",
                  "asciifolding"
               ]
            }
         }
      }
   },
   "mappings":{                                                                         
      "name":{
         "properties":{
            "articlenames":{
               "type":"text",
               "analyzer":"nGram_analyzer",
               "search_analyzer": "standard"
            }
         }
      }
   }
}'

同样在查询时尝试此查询:

GET my_index/_search
{
  "query": {
    "match": {
      "articlenames": {
        "query": "Sachin T", 
        "operator": "and"
      }
    }
  }
}
于 2018-03-16T06:31:27.230 回答
0

我知道为时已晚,但是任何正在为此寻找解决方案的人都可以尝试此查询。映射和索引是正确的。查询部分中似乎缺少和运算符。

GET index_wiki_articlenames/_search
{
  "query": {
    "match": {
      "articlenames": {
        "query": "sachin ten", 
        "operator": "and"
      }
    }
  }
}

这导致

sachin tendulkar
sachin tendulkar centuries
于 2020-10-03T03:13:24.970 回答