我主要使用 R 包 dbplyr 与 PostgreSQL 数据库进行交互。这通过“管道”操作来工作,然后将这些操作转换为 SQL 并在一个查询中执行。这往往会导致许多嵌套连接。我想知道规划器在解决这种相当冗长和未优化的表达式时有多聪明。只要它们只使用例如 SELECT、WHERE 和 JOIN(并且没有函数、强制转换等)并且最终结果相同,是否甚至可以编写“错误查询”?这样的查询看起来如何?例如,规划器是否会计算出哈希连接中需要哪些列以减少内存,即使在该连接中没有指定列,但仅在涉及 6 个表的连接之后才指定?
例如,我可以安全地忽略:
- 加入订单
- 选择列时
- 应用过滤器时
我发现了很多关于规划器如何计算成本和选择路径的信息,但没有太多关于它如何首先到达查询的“最小形式”的信息。EXPLAIN ANALYZE 没有帮助,因为它没有显示最终选择了哪些列。我敢肯定,由于太含糊,有人会对这个问题不满意。如果是这样,请指出我正确的方向:)
编辑:
一个例子。
这是使用 dbplyr 在 R 中典型查询的外观。“gene_annotations”有“gene”和“annotation_term”列。“genemaps”有“genemap”、“gene”、“probe”、“study”。在这里,我想获取与探针相关的基因和注释。
tbl(con, "gene_annotations") %>% inner_join(tbl(con, "genemaps"), by = "gene") %>%
filter(probe == 1L) %>% select(gene, annotation_term)
这转化为:
SELECT "gene", "annotation_term"
FROM (SELECT "LHS"."gene" AS "gene", "LHS"."annotation_term" AS "annotation_term", "RHS"."genemap" AS "genemap", "RHS"."probe" AS "probe", "RHS"."study" AS "study"
FROM "gene_annotations" AS "LHS"
INNER JOIN "genemaps" AS "RHS"
ON ("LHS"."gene" = "RHS"."gene")
) "dbplyr_004"
WHERE ("probe" = 1)
我可以相信这与例如此表达式具有完全相同的性能(解析和分析表达式的时间除外)吗?
tbl(con, "gene_annotations") %>% inner_join(tbl(con, "genemaps") %>%
filter(probe == 1L) %>% select(gene) , by = "gene")
SELECT "LHS"."gene" AS "gene", "LHS"."annotation_term" AS "annotation_term"
FROM "gene_annotations" AS "LHS"
INNER JOIN (SELECT "gene"
FROM "genemaps"
WHERE ("probe" = 1)) "RHS"
ON ("LHS"."gene" = "RHS"."gene")
两种情况下的计划都是一样的:
Nested Loop (cost=0.86..72.09 rows=546 width=8)
-> Index Only Scan using genemaps_probe_index on genemaps (cost=0.43..2.16 rows=36 width=4)
Index Cond: (probe = 1)
-> Index Only Scan using gene_annotations_pkey on gene_annotations "LHS" (cost=0.43..1.79 rows=15 width=8)
Index Cond: (gene = genemaps.gene)
我不想提供示例,因为我对这个特定查询没有任何问题。我想知道我是否总是可以完全忽略这些问题,只是拼凑连接,直到得到我想要的最终结果。
编辑2:
我发现 EXPLAIN 有一个 VERBOSE 选项,您可以在其中查看返回了哪些列。对于上面的小例子,计划在这方面也是相同的。不过,我可以假设所有合理复杂的查询都适用吗?这是我的查询通常看起来如何的示例。如您所见,dbplyr 生成的 SQL 不是很容易阅读。在这里,它在各种 SELECT/WHERE 之后连接了六个表。
SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."value" AS "value", "LHS"."gene" AS "gene", "LHS"."probe" AS "probe", "LHS"."gene_symbol" AS "gene_symbol", "LHS"."probe_name" AS "probe_name", "RHS"."factor_order" AS "factor_order"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."value" AS "value", "LHS"."gene" AS "gene", "LHS"."probe" AS "probe", "LHS"."gene_symbol" AS "gene_symbol", "RHS"."probe_name" AS "probe_name"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."value" AS "value", "LHS"."gene" AS "gene", "LHS"."probe" AS "probe", "RHS"."gene_symbol" AS "gene_symbol"
FROM (SELECT "sample_group", "sample_group_name", "sample_group_description", "sample", "sample_name", "value", "gene", "probe"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."genemap" AS "genemap", "LHS"."annotation_term" AS "annotation_term", "LHS"."value" AS "value", "RHS"."gene" AS "gene", "RHS"."probe" AS "probe"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "RHS"."genemap" AS "genemap", "RHS"."annotation_term" AS "annotation_term", "RHS"."value" AS "value"
FROM (SELECT *
FROM (SELECT "sample_group", "sample_group_name", "sample_group_description", "sample", "sample_name"
FROM "sample_view") "dbplyr_031"
WHERE (270 = 270)) "LHS"
INNER JOIN "gene_measurements" AS "RHS"
ON ("LHS"."sample" = "RHS"."sample")
) "LHS"
INNER JOIN (SELECT "genemap", "gene", "probe"
FROM "genemaps"
WHERE ("gene" IN (54812) AND "study" = 270)) "RHS"
ON ("LHS"."genemap" = "RHS"."genemap")
) "dbplyr_032") "LHS"
INNER JOIN (SELECT "gene", "gene_symbol"
FROM "genes") "RHS"
ON ("LHS"."gene" = "RHS"."gene")
) "LHS"
INNER JOIN (SELECT "probe", "probe_name"
FROM "probes") "RHS"
ON ("LHS"."probe" = "RHS"."probe")
) "LHS"
INNER JOIN (SELECT "group", "annotation_term_value" AS "factor_order"
FROM (SELECT "LHS"."group" AS "group", "LHS"."annotation_term" AS "annotation_term", "RHS"."annotation_term_value" AS "annotation_term_value"
FROM "group_annotations" AS "LHS"
INNER JOIN (SELECT "annotation_term", "annotation_term_value"
FROM "annotation_terms"
WHERE ("annotation_type" = 111)) "RHS"
ON ("LHS"."annotation_term" = "RHS"."annotation_term")
) "dbplyr_033") "RHS"
ON ("LHS"."sample_group" = "RHS"."group")