我的索引中的术语有一个备用拼写文件。我想生成包含特定术语的替代拼写的二元组。例如,我biriyani, biryani, briyani
的备用拼写 csv 文件中有我的字段包含文本Chicken Biryani
。我希望能够生产chicken biryani, chicken biriyani, chicken briyani
代币。
现在,如果我使用带有同义词过滤器的空白标记器,则会生成chicken, biriyani, biryani, briyani
预期的以下标记。现在,如果我应用 shingle 过滤器,则生成的令牌是chicken, chicken biryani, biryani, biryani biriyani, biriyani, biriyani briyani, briyani
. 此标记流包含单词本身的同义词的带状疱疹,这些同义词不应该存在,并且它不包含带有chicken [alternate spellings of biryani]
像 chicken biriyani 或 chicken briyani 等的标记。如果我在同义词过滤器之前放置 shingle 过滤器,那么它只会添加同义词标记一元:chicken, chicken biryani, biriyani, biryani, briyani
。有没有办法生成包含与原始标记相同位置的同义词的标记,或者在这种情况下chicken biryani, chicken biriyani, chicken briyani
测试示例设置:
PUT test_bigram
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"biriyani, biryani, briyani"
]
}
},
"analyzer": {
"synonym_analyzer": {
"filter": [
"synonym"
],
"type": "custom",
"tokenizer": "whitespace"
},
"shingle_synonym": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"shingle",
"synonym"
]
},
"synonym_shingle": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"synonym",
"shingle"
]
}
}
}
}
}
}
我正在运行 Elasticsearch 5.6