我在所有维基百科文章名称的巨大数据集上使用弹性搜索它们大约有 500 万个数字数据库字段名称是文章名称
curl -XPUT "http://localhost:9200/index_wiki_articlenames/" -d'
{
"settings":{
"analysis":{
"filter":{
"nGram_filter":{
"type":"edgeNGram",
"min_gram":1,
"max_gram":20,
"token_chars":[
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"tokenizer":{
"edge_ngram_tokenizer":{
"type":"edgeNGram",
"min_gram":"1",
"max_gram":"20",
"token_chars":[
"letter",
"digit"
]
}
},
"analyzer":{
"nGram_analyzer":{
"type":"custom",
"tokenizer":"edge_ngram_tokenizer",
"filter":[
"lowercase",
"asciifolding"
]
}
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
},
"mappings":{
"name":{
"properties":{
"articlenames":{
"type":"text",
"analyzer":"nGram_analyzer"
}
}
}
}
}'
参考这些链接也可以解决我的问题,但徒劳无功
https://hackernoon.com/elasticsearch-building-autocomplete-functionality-494fcf81a7cf
我的目标是为“sachin t”的输入查询获得如下结果
sachin tendulkar
sachin tendulkar centuries
sachin tejas
sachin top 60 quotes
sachin talwalkar
sachin tawade
sachin taps
并查询“sachin te”
sachin tendulkar
sachin tendulkar centuries
sachin tejas
并查询“sachin ta”
sachin talwalkar
sachin tawade
sachin taps
并查询“sachin 十”
sachin tendulkar
sachin tendulkar centuries
请记住,数据集非常庞大,一些文章名称和单词可能包含特殊字符和单词,例如“Bronisław-Komorowski”
我能够获得多达 10 万条记录的较小数据集的输出,但是一旦我的数据集更改为 0.5 到 5 百万条记录,我就无法获得输出
我的查询是
http://127.0.0.1:9200/index_wiki_articlenames/_search?&q=articlenames:sachin-t+articlenames:sachin-t.*&filter_path=hits.hits._source.articlenames&size=50