我已经索引了 200 万个文档,并且我试图一次返回所有匹配的文档 ID。我使用 PHP 客户端。
我的映射如下:
$params = [
'index' => $index,
'body' => [
'settings' => [
"number_of_shards" => 1,
"number_of_replicas" => 0,
"index.queries.cache.enabled" => false,
"index.soft_deletes.enabled" => false,
"index.refresh_interval" => -1,
"index.requests.cache.enable" => false,
"index.max_result_window"=> $result_window
],
'mappings' => [
'_source' => [
"enabled" => false
],
'properties' => [
"text" => [
"type" => "text",
"index_options" => "docs"
]
]
]
]
];
我的查询字符串如下:
$json = '{
"from" : 0, "size" : '.$size.',
"profile": true,
"query": {
"bool": {
"filter" : {
"match" : {
"text" : {
"query" : "justin trump clinton harry",
"operator" : "and"
}
}
}
}
}
}';
我的个人资料 API 输出如下:
Array
(
[shards] => Array
(
[0] => Array
(
[id] => [tod2gbVKSRGinZVfdXTmxA][elasticindex-2][0]
[searches] => Array
(
[0] => Array
(
[query] => Array
(
[0] => Array
(
[type] => BoostQuery
[description] => (ConstantScore(+text:justin +text:trump +text:clinton +text:harry))^0.0
[time_in_nanos] => 176108294
[breakdown] => Array
(
[set_min_competitive_score_count] => 0
[match_count] => 0
[shallow_advance_count] => 0
[set_min_competitive_score] => 0
[next_doc] => 158666901
[match] => 0
[next_doc_count] => 439522
[score_count] => 439522
[compute_max_score_count] => 0
[compute_max_score] => 0
[advance] => 262234
[advance_count] => 1
[score] => 14477781
[build_scorer_count] => 2
[create_weight] => 401058
[shallow_advance] => 0
[create_weight_count] => 1
[build_scorer] => 1421272
)
[children] => Array
(
[0] => Array
(
[type] => BooleanQuery
[description] => +text:justin +text:trump +text:clinton +text:harry
[time_in_nanos] => 128547273
[breakdown] => Array
(
[set_min_competitive_score_count] => 0
[match_count] => 0
[shallow_advance_count] => 0
[set_min_competitive_score] => 0
[next_doc] => 126071813
[match] => 0
[next_doc_count] => 439522
[score_count] => 0
[compute_max_score_count] => 0
[compute_max_score] => 0
[advance] => 260695
[advance_count] => 1
[score] => 0
[build_scorer_count] => 2
[create_weight] => 373620
[shallow_advance] => 0
[create_weight_count] => 1
[build_scorer] => 1401619
)
[children] => Array
(
[0] => Array
(
[type] => TermQuery
[description] => text:justin
[time_in_nanos] => 40691947
)
[1] => Array
(
[type] => TermQuery
[description] => text:trump
[time_in_nanos] => 42972729
)
[2] => Array
(
[type] => TermQuery
[description] => text:clinton
[time_in_nanos] => 29407195
)
[3] => Array
(
[type] => TermQuery
[description] => text:harry
[time_in_nanos] => 33799904
)
)
)
)
)
)
[rewrite_time] => 260704
[collector] => Array
(
[0] => Array
(
[name] => SimpleTopScoreDocCollector
[reason] => search_top_hits
[time_in_nanos] => 116380511
)
)
)
)
)
)
)
目标是一次获取所有匹配的文档。我只需要文档 ID(检查给定术语是否存在于文档中)所以我使用 index_options 作为文档。我了解滚动 API,但我想使用 max_result_window。我只使用一个分片,没有副本,并且在执行搜索操作时也避免了对文档进行评分。
我的问题如下:
我只想检索文档 ID 并避免文档获取阶段,所以我禁用了源字段。为了避免其他元数据,我按照此链接尝试了 "stored_fields": " none ", "docvalue_fields": ["_id"] 以避免获取阶段。但我仍然可以看到文档类型和索引名称。我需要做些什么来仅获取文档 ID 并避免获取阶段吗?
由于我正在检索所有匹配的文档评分与我无关,所以我使用了过滤器子句,但我想知道为什么我在下面的配置文件 API 结果中得到 boostquery 时间?但你也可以注意到,布尔查询得分时间为零!
为了知道布尔查询搜索仅在 Lucene 索引上花费了多少时间,我应该只花布尔查询报告的时间,还是需要将其所有子项(术语查询)时间相加?因为当我添加所有这些术语查询时间时,它高于布尔查询报告的时间。这有什么可能的原因吗?
我是否还需要为我的布尔查询计时包括收集器,因为在配置文件 api 中,据说“Lucene 通过定义一个“收集器”来工作,该收集器负责协调匹配文档的遍历、评分和收集。“。它还说“需要注意的是,收集器时间独立于查询时间。它们是独立计算、组合和标准化的!由于 Lucene 执行的性质,不可能将收集器中的时间“合并”到查询部分,因此它们显示在单独的部分中”。据我了解,它有助于遍历Lucene索引的posts列表来执行布尔查询操作。我在这方面是对的吗?
是否有任何类似的 API 用于调查 Elasticsearch 中的索引时间?我能够在设置 API 中获得索引时间,但我正在寻找类似于配置文件 API 的东西?