python - Number of results from elasticserch fuzzy_like_this_field influence "good" results being returned

Question

Sorry for the lousy title, but let me explain the problem I'm having. I'm currently working on a project, and a part of this includes a search engine for addresses, which I have in elasticsearch. What I'm trying to do is use fuzzy_like_this_field queries when a new character is entered in my search bar to generate autocomplete results and try to "guess" which of the (~1 million) addresses the user is typing.

My issue is that I currently have a size limit on my query, as returning all of the results was both unnecessary and expensive, time-wise. My issue, is that I often am not getting the "correct" result unless I return 1000 or more results from the query. For example, if I enter "100 broad" in trying to search for "100 broadway" and I only return 200 results (about the max that I can do without it taking too long), 100 broadway is nowhere to be found, even though all of the returned results have a higher levenshtein distance than the result that I want. I get "100 broadway" as the first result if I return 2000 results from my query, but it takes too long. I can't even filter the results that got returned to bring the correct one to the top, because it's not being returned.

Shouldn't putting a size limit of N on the query return the best N results, not a seemingly random subset of them?

Sorry if this is poorly worded or too vague.

score 0 · Accepted Answer

fuzzy_like_this我想您可能对查询有一些误解。

模糊化作为字符串提供的所有术语，然后选择最好的 n 个区分术语...对于每个源术语，模糊变体都保存在没有坐标因子的 BooleanQuery 中...

如果您只想基于 Levenshtein 距离进行模糊搜索，请使用fuzzy查询

score 0 · Accepted Answer

您可以使用边缘 ngram 标记器编写自定义分析器，这将帮助您实现所需的内容。在这里找到弹性搜索演示的技术 https://www.elastic.co/guide/en/elasticsearch/guide/current/_index_time_search_as_you_type.html

然后做一个简单的查询，如

{
    "query": {
        "match":{
            "address": "100 Broadway"
         }
      }
}

会做你的工作。您也可以考虑使用不同的分析器进行搜索，教程中也显示了（最后）。这将使您能够做一些事情，例如标记您的搜索查询并以不同于索引分析的方式对其进行预处理。

python - Number of results from elasticserch fuzzy_like_this_field influence "good" results being returned

2 回答 2

Related

Reference