1

我正在使用 Elasticsearch 以在生产环境中实现未来。我的问题是我需要使用模糊搜索和语音来实现我的目标,如下:

  • 使用模糊匹配查询
GET _search
{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "type": "most_fields", 
            "query": "MUSIC: DOWNLOAD The Beatle$ – hey jode -FLAC-WEB- CDQ-2014",
            "fuzzy_transpositions": "true", 
            "fuzziness": "AUTO", 
            "fields": ["artist_name", "title_track"],
            "slop": 100,
            "max_expansions": 30
          }
        },
        {
          "multi_match": {
            "type": "cross_fields", 
            "query": "MUSIC: DOWNLOAD The Beatle$ – hey jode -FLAC-WEB- CDQ-2014",
            "fields": ["artist_name", "title_track"],
            "boost": 5, 
            "operator": "and",
            "max_expansions": 30
          }
        }]
}
}
}
  • 结果非常好,即使在查询中弄乱了字符串:
{
  "took": 316,
  "timed_out": false,
  "_shards": {
    "total": 11,
    "successful": 11,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1169343,
    "max_score": 26.201363,
    "hits": [
      {
        "_index": "repmatch",
        "_type": "repertoire",
        "_id": "zVzFm2gB0djhmNXkB5y-",
        "_score": 26.201363,
        "_source": {
          "title_track": "HEY JUDE",
          "album_id": null,
          "artist_id": 38387,
          "artist_name": """"BEATLES, THE""""
        }
      },
      {
        "_index": "repmatch",
        "_type": "repertoire",
        "_id": "X1ETmmgB0djhmNXkARTQ",
        "_score": 26.201363,
        "_source": {
          "title_track": "HEY JUDE",
          "album_id": null,
          "artist_id": 21183,
          "artist_name": "THE  BEATLES"
        }
      },
      {
        "_index": "repmatch",
        "_type": "repertoire",
        "_id": "MF34m2gB0djhmNXkTvIn",
        "_score": 26.080318,
        "_source": {
          "title_track": "HEY JUDE",
          "album_id": 6135978,
          "artist_id": 40333,
          "artist_name": "BEATLES, THE"
        }
      },
...

  • 当我没有索引艺术家和/或曲目时,问题就开始了:
GET _search
{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "type": "most_fields", 
            "query": "justin bieber - sorry",
            "fuzzy_transpositions": "true", 
            "fuzziness": "AUTO", 
            "fields": ["artist_name", "title_track"],
            "slop": 100,
            "max_expansions": 30
          }
        },
        {
          "multi_match": {
            "type": "cross_fields", 
            "query": "justin bieber - sorry",
            "fields": ["artist_name", "title_track"],
            "boost": 5, 
            "operator": "and",
            "max_expansions": 30
          }
        }]
}
}
}
  • 结果没有返回贾斯汀比伯,因为它没有被索引
{
  "took": 121,
  "timed_out": false,
  "_shards": {
    "total": 11,
    "successful": 11,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 19730,
    "max_score": 24.51635,
    "hits": [
      {
        "_index": "repmatch",
        "_type": "repertoire",
        "_id": "-XfOn2gB0djhmNXkENiE",
        "_score": 24.51635,
        "_source": {
          "title_track": "JUSTIN",
          "album_id": 5897467,
          "artist_id": 117964,
          "artist_name": "JUSTIN"
        }
      },
      {
        "_index": "repmatch",
        "_type": "repertoire",
        "_id": "yXfOn2gB0djhmNXkCdjW",
        "_score": 24.42126,
        "_source": {
          "title_track": "JUSTIN",
          "album_id": null,
          "artist_id": 117964,
          "artist_name": "JUSTIN"
        }
      },
      {
        "_index": "repmatch",
        "_type": "repertoire",
        "_id": "iDxal2gB0djhmNXkY_ew",
        "_score": 23.26923,
        "_source": {
          "title_track": "JUSTIN BIEBER",
          "album_id": null,
          "artist_id": 10851,
          "artist_name": "SMASH MOUTH"
        }
      },
...

目标是了解艺术家和曲目是否被索引。我需要尽可能准确的结果,但仍然使用模糊性来掩盖拼写错误。

我的想法是使用带有 metaphone 的语音插件来对检索到的文档和输入字符串进行后处理,这样可以定义为文档生成的 metaphone 是否存在于输入字符串的 metaphone 上。我希望我可以提供一个查询,而 Elasticsearch 可以在同一结果集上返回所有这些信息,甚至告诉我是否找到了匹配项。

我只能使用语音字符串调用:

GET phonetic/_analyze
{
  "analyzer": "phonetic",
  "text": "The Beatles – Hello Goodbye"
} 

或者

GET /phonetic/phonetic/_search
{
    "query": {
        "match": {
            "user.phonetic": {
                "query":"beatles"
            }
        }
    }
}

这与我需要的相差甚远,因为我无法在同一字段中使用语音和模糊搜索:\

以下是语音分析器和过滤器的创建方式:

PUT /phonetic
{
  "settings": {
    "analysis": {
      "filter": {
        "dbl_metaphone": {
          "type":    "phonetic",
          "encoder": "double_metaphone"
        }
      },
      "analyzer": {
        "dbl_metaphone": {
          "tokenizer": "standard",
          "filter":    "dbl_metaphone"
        }
      }
    }
  }
}

PUT /phonetic/_mapping/phonetic
{
  "properties": {
    "user": {
      "type": "text",
      "fields": {
        "phonetic": {
          "type":     "text",
          "analyzer": "dbl_metaphone"
        }
      }
    }
  }
}

例如,我没有找到关于 Elasticsearch 的语音插件或如何在脚本上使用它的更详细资料(本例中的想法是对每个文档进行后处理并为每个标记生成语音,然后将它们与搜索字符串)。

我可以编写一个外部程序来接收和处理 Elasticsearch 的结果,但这太笨拙了,因为现在我有两个 API,一个调用另一个(我仍然需要通过 API 提供结果)。

总而言之,我需要确保对艺术家和曲目进行索引,但同时我需要接受拼写错误。

提前谢谢了。

4

0 回答 0