lucene - 与 ElasticSearch 的精确文档匹配

Question

我需要准确地查询一组“短文档”。例子：

文件：

{“name”：“John Doe”，“alt”：“John W Doe”}
{“name”：“我的朋友 John Doe”，“alt”：“John A Doe”}
{“名称”：“约翰”，“替代”：“苏西”}
{"name": "Jack", "alt": "John Doe"}

预期成绩：

如果我搜索“John Doe”，我希望 1 的分数远大于 2 和 4 的分数
如果我搜索“John Doé”，同上
如果我搜索“John”，我想得到 3（完全匹配优于重复名称和 alt）

ES可以吗？我怎样才能做到这一点？我尝试提升“名称”，但我找不到如何完全匹配文档字段，而不是在其中搜索。

score 5 · Accepted Answer

您所描述的正是搜索引擎默认的工作方式。搜索"John Doe"变成了对术语"john"和的搜索"doe"。对于每个术语，它会查找包含该术语的文档，然后为_score每个文档分配 a，基于：

该术语在所有文档中的常见程度（更常见 == 相关性较低）
文档领域中的术语有多常见（更常见 == 更相关）
文档的字段有多长（更长 == 不太相关）

您没有看到明确结果的原因是 Elasticsearch 是分布式的，并且您正在使用少量数据进行测试。默认情况下，一个索引有 5 个主分片，并且您的文档在不同的分片上建立索引。每个分片都有自己的文档频率计数，因此分数被扭曲了。

当您添加真实世界的数据量时，频率甚至会超出分片，但要测试少量数据，您需要执行以下两项操作之一：

创建一个只有一个主分片的索引，或者
指定search_type=dfs_query_then_fetch在使用全局频率运行查询之前首先从每个分片获取频率

为了演示，首先索引您的数据：

curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
{
   "alt" : "John W Doe",
   "name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1'  -d '
{
   "alt" : "John A Doe",
   "name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
{
   "alt" : "Susy",
   "name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
{
   "alt" : "John Doe",
   "name" : "Jack"
}
'

现在，搜索"john doe"，记得指定dfs_query_then_fetch。

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
   "query" : {
      "match" : {
         "name" : "john doe"
      }
   }
}
'

Doc 1 是结果中的第一个：

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "alt" : "John W Doe",
#                "name" : "John Doe"
#             },
#             "_score" : 1.0189849,
#             "_index" : "test",
#             "_id" : "1",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John A Doe",
#                "name" : "My friend John Doe"
#             },
#             "_score" : 0.81518793,
#             "_index" : "test",
#             "_id" : "2",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "Susy",
#                "name" : "John"
#             },
#             "_score" : 0.3066778,
#             "_index" : "test",
#             "_id" : "3",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 1.0189849,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 8
# }

当您搜索时"john"：

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
   "query" : {
      "match" : {
         "name" : "john"
      }
   }
}
'

Doc 3 首先出现：

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "alt" : "Susy",
#                "name" : "John"
#             },
#             "_score" : 1,
#             "_index" : "test",
#             "_id" : "3",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John W Doe",
#                "name" : "John Doe"
#             },
#             "_score" : 0.625,
#             "_index" : "test",
#             "_id" : "1",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John A Doe",
#                "name" : "My friend John Doe"
#             },
#             "_score" : 0.5,
#             "_index" : "test",
#             "_id" : "2",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 1,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 5
# }

忽略重音

第二个问题是匹配"John Doé“。这是一个分析问题。为了使全文更易于搜索，我们将其分析为单独的术语或标记，它们存储在索引中。为了匹配例如john，John和JOHN当用户搜索时john，每个术语/标记都会通过多个标记过滤器，以将它们放入标准形式。

当我们进行全文搜索时，搜索词也会经历同样的过程。因此，如果我们有一个包含的文档John，则索引为john，如果用户搜索JOHN，我们实际搜索的是john。

为了Doé匹配doe，我们需要一个去除重音符号的过滤器，我们需要将它应用于被索引的文本和搜索词。最简单的方法是使用ASCII 折叠令牌过滤器。

我们可以在创建索引时定义自定义分析器，并且我们可以在映射中指定特定字段应在索引时和搜索时使用该分析器。

首先，删除旧索引：

curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'

然后创建索引，指定自定义分析器和映射：

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "no_accents" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "asciifolding"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   },
   "mappings" : {
      "test" : {
         "properties" : {
            "name" : {
               "type" : "string",
               "analyzer" : "no_accents"
            }
         }
      }
   }
}
'

重新索引数据：

curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
{
   "alt" : "John W Doe",
   "name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1'  -d '
{
   "alt" : "John A Doe",
   "name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
{
   "alt" : "Susy",
   "name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
{
   "alt" : "John Doe",
   "name" : "Jack"
}
'

现在，测试搜索：

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
   "query" : {
      "match" : {
         "name" : "john doé"
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "alt" : "John W Doe",
#                "name" : "John Doe"
#             },
#             "_score" : 1.0189849,
#             "_index" : "test",
#             "_id" : "1",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John A Doe",
#                "name" : "My friend John Doe"
#             },
#             "_score" : 0.81518793,
#             "_index" : "test",
#             "_id" : "2",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "Susy",
#                "name" : "John"
#             },
#             "_score" : 0.3066778,
#             "_index" : "test",
#             "_id" : "3",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 1.0189849,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 6
# }

score 2 · Accepted Answer

我认为如果您映射为多个字段并提升非分析字段，您将实现您所需要的：

 "name": {
            "type": "multi_field",
            "fields": {
                "untouched": {
                    "type": "string",
                    "index": "not_analyzed",
                    "boost": "1.1"
                },
                "name": {
                    "include_in_all": true,
                    "type": "string",
                    "index": "analyzed",
                    "search_analyzer": "someanalyzer",
                    "index_analyzer": "someanalyzer"
                }
            }
        }

如果您需要灵活性，还可以通过在 query_string 中使用 '^'-notation 来提高查询时间而不是索引时间

{
    "query_string" : {
        "fields" : ["name, name.untouched^5"],
        "query" : "this AND that OR thus",
    }
}

lucene - 与 ElasticSearch 的精确文档匹配

2 回答 2

忽略重音

Related

Reference