我正在使用 Elasticsearch 以在生产环境中实现未来。我的问题是我需要使用模糊搜索和语音来实现我的目标,如下:
- 使用模糊匹配查询
GET _search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"type": "most_fields",
"query": "MUSIC: DOWNLOAD The Beatle$ – hey jode -FLAC-WEB- CDQ-2014",
"fuzzy_transpositions": "true",
"fuzziness": "AUTO",
"fields": ["artist_name", "title_track"],
"slop": 100,
"max_expansions": 30
}
},
{
"multi_match": {
"type": "cross_fields",
"query": "MUSIC: DOWNLOAD The Beatle$ – hey jode -FLAC-WEB- CDQ-2014",
"fields": ["artist_name", "title_track"],
"boost": 5,
"operator": "and",
"max_expansions": 30
}
}]
}
}
}
- 结果非常好,即使在查询中弄乱了字符串:
{
"took": 316,
"timed_out": false,
"_shards": {
"total": 11,
"successful": 11,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1169343,
"max_score": 26.201363,
"hits": [
{
"_index": "repmatch",
"_type": "repertoire",
"_id": "zVzFm2gB0djhmNXkB5y-",
"_score": 26.201363,
"_source": {
"title_track": "HEY JUDE",
"album_id": null,
"artist_id": 38387,
"artist_name": """"BEATLES, THE""""
}
},
{
"_index": "repmatch",
"_type": "repertoire",
"_id": "X1ETmmgB0djhmNXkARTQ",
"_score": 26.201363,
"_source": {
"title_track": "HEY JUDE",
"album_id": null,
"artist_id": 21183,
"artist_name": "THE BEATLES"
}
},
{
"_index": "repmatch",
"_type": "repertoire",
"_id": "MF34m2gB0djhmNXkTvIn",
"_score": 26.080318,
"_source": {
"title_track": "HEY JUDE",
"album_id": 6135978,
"artist_id": 40333,
"artist_name": "BEATLES, THE"
}
},
...
- 当我没有索引艺术家和/或曲目时,问题就开始了:
GET _search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"type": "most_fields",
"query": "justin bieber - sorry",
"fuzzy_transpositions": "true",
"fuzziness": "AUTO",
"fields": ["artist_name", "title_track"],
"slop": 100,
"max_expansions": 30
}
},
{
"multi_match": {
"type": "cross_fields",
"query": "justin bieber - sorry",
"fields": ["artist_name", "title_track"],
"boost": 5,
"operator": "and",
"max_expansions": 30
}
}]
}
}
}
- 结果没有返回贾斯汀比伯,因为它没有被索引
{
"took": 121,
"timed_out": false,
"_shards": {
"total": 11,
"successful": 11,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 19730,
"max_score": 24.51635,
"hits": [
{
"_index": "repmatch",
"_type": "repertoire",
"_id": "-XfOn2gB0djhmNXkENiE",
"_score": 24.51635,
"_source": {
"title_track": "JUSTIN",
"album_id": 5897467,
"artist_id": 117964,
"artist_name": "JUSTIN"
}
},
{
"_index": "repmatch",
"_type": "repertoire",
"_id": "yXfOn2gB0djhmNXkCdjW",
"_score": 24.42126,
"_source": {
"title_track": "JUSTIN",
"album_id": null,
"artist_id": 117964,
"artist_name": "JUSTIN"
}
},
{
"_index": "repmatch",
"_type": "repertoire",
"_id": "iDxal2gB0djhmNXkY_ew",
"_score": 23.26923,
"_source": {
"title_track": "JUSTIN BIEBER",
"album_id": null,
"artist_id": 10851,
"artist_name": "SMASH MOUTH"
}
},
...
目标是了解艺术家和曲目是否被索引。我需要尽可能准确的结果,但仍然使用模糊性来掩盖拼写错误。
我的想法是使用带有 metaphone 的语音插件来对检索到的文档和输入字符串进行后处理,这样可以定义为文档生成的 metaphone 是否存在于输入字符串的 metaphone 上。我希望我可以提供一个查询,而 Elasticsearch 可以在同一结果集上返回所有这些信息,甚至告诉我是否找到了匹配项。
我只能使用语音字符串调用:
GET phonetic/_analyze
{
"analyzer": "phonetic",
"text": "The Beatles – Hello Goodbye"
}
或者
GET /phonetic/phonetic/_search
{
"query": {
"match": {
"user.phonetic": {
"query":"beatles"
}
}
}
}
这与我需要的相差甚远,因为我无法在同一字段中使用语音和模糊搜索:\
以下是语音分析器和过滤器的创建方式:
PUT /phonetic
{
"settings": {
"analysis": {
"filter": {
"dbl_metaphone": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "standard",
"filter": "dbl_metaphone"
}
}
}
}
}
PUT /phonetic/_mapping/phonetic
{
"properties": {
"user": {
"type": "text",
"fields": {
"phonetic": {
"type": "text",
"analyzer": "dbl_metaphone"
}
}
}
}
}
例如,我没有找到关于 Elasticsearch 的语音插件或如何在脚本上使用它的更详细资料(本例中的想法是对每个文档进行后处理并为每个标记生成语音,然后将它们与搜索字符串)。
我可以编写一个外部程序来接收和处理 Elasticsearch 的结果,但这太笨拙了,因为现在我有两个 API,一个调用另一个(我仍然需要通过 API 提供结果)。
总而言之,我需要确保对艺术家和曲目进行索引,但同时我需要接受拼写错误。
提前谢谢了。