postgresql - 使用 postgres_fdw 加速包含多个自加入分组的视图

Question

（警告道歉和黑客入侵......）

背景：

我有一个遗留应用程序，我想避免重写它的大量 SQL 代码。我正在尝试加速它所做的特定类型的非常昂贵的查询（即：低悬的果实）。

它有一个由transactions表表示的金融交易分类帐。当插入新行时，触发函数（此处未显示）为给定实体结转新余额。

某些类型的交易模型外部性（如进行中的支付）通过使用“相关”交易标记新交易，以便应用程序可以将相关交易组合在一起。

\d transactions

                  Table "public.transactions"
       Column        |   Type    | Modifiers 
---------------------+-----------+-----------
 entityid            | bigint    | not null
 transactionid       | bigint    | not null default nextval('tid_seq')
 type                | smallint  | not null
 status              | smallint  | not null
 related             | bigint    | 
 amount              | bigint    | not null
 abs_amount          | bigint    | not null
 is_credit           | boolean   | not null
 inserted            | timestamp | not null default now()
 description         | text      | not null
 balance             | bigint    | not null

Indexes:
    "transactions_pkey" PRIMARY KEY, btree (transactionid)
    "transactions by entityid" btree (entityid)
    "transactions by initial trans" btree ((COALESCE(related, transactionid)))

Foreign-key constraints:
    "invalid related transaction!" FOREIGN KEY (related) 
                                   REFERENCES transactions(transactionid)

在我的测试数据集中，我有：

总共大约 550 万行
大约 370 万行没有“相关”事务
大约 180 万行“相关”交易
大约 55k 个不同的 entityids（客户）。

因此，大约 1/3 的事务行是与一些早期事务“相关”的更新。生产数据大约大 25 倍transactionid，不同大约大 8 倍entityid，1/3 的比率用于事务更新。

该代码查询一个特别低效的 VIEW 定义为：

CREATE VIEW collapsed_transactions AS
SELECT t.entityid,
    g.initial,
    g.latest,
    i.inserted AS created,
    t.inserted AS updated,
    t.type,
    t.status,
    t.amount,
    t.abs_amount,
    t.is_credit,
    t.balance,
    t.description
FROM ( SELECT 
          COALESCE(x.related, x.transactionid) AS initial,
          max(x.transactionid) AS latest
       FROM transactions x
       GROUP BY COALESCE(x.related, x.transactionid)
     ) g
INNER JOIN transactions t ON t.transactionid = g.latest
INNER JOIN transactions i ON i.transactionid = g.initial;

典型的查询采用以下形式：

SELECT * FROM collapsed_transactions WHERE entityid = 204425;

如您所见，该where entityid = 204425子句不会用于约束GROUP BY子查询，因此所有实体的事务将被分组，从而产生 55,000 个更大的子查询结果集和更长的查询时间......所有这些都达到平均 40 行（本例中为 71）在撰写本文时。

在不重写代码库的数百个 SQL 查询的情况下，我无法transactions进一步规范化表（比如将表连接到），其中许多以不同的方式使用自连接语义。initial_transactionsupdated_transactionsrelated

洞察力：

我最初尝试使用 WINDOW 函数重写查询，但遇到了各种各样的问题（另一个 SO 问题再次出现），当我看到www_fdw将其 WHERE 子句作为 GET/POST 参数传递给 HTTP 时，我很感兴趣无需太多重组即可优化非常幼稚的查询的可能性。

Postgresql 9.3 手册说：

F.31.4。远程查询优化

postgres_fdw 尝试优化远程查询以减少从外部服务器传输的数据量。这是通过将查询 WHERE 子句发送到远程服务器执行，并且不检索当前查询不需要的表列来完成的。为了降低查询错误执行的风险，WHERE 子句不会发送到远程服务器，除非它们仅使用内置数据类型、运算符和函数。子句中的运算符和函数也必须是 IMMUTABLE。

可以使用 EXPLAIN VERBOSE 检查实际发送到远程服务器执行的查询。

试图：

所以我认为也许我可以将 GROUP-BY 放入一个视图中，将该视图视为一个外部表，并且优化器将通过 WHERE 子句传递到该外部表，从而产生更有效的查询......

CREATE VIEW foreign_transactions_grouped_by_initial_transaction AS 
  SELECT
    entityid,
    COALESCE(t.related, t.transactionid) AS initial,
    MAX(t.transactionid) AS latest
  FROM transactions t
  GROUP BY
    t.entityid,
    COALESCE(t.related, t.transactionid);

CREATE FOREIGN TABLE transactions_grouped_by_initial_transaction 
  (entityid bigint, initial bigint, latest bigint) 
  SERVER local_pg_server 
  OPTIONS (table_name 'foreign_transactions_grouped_by_initial_transaction');

EXPLAIN ANALYSE VERBOSE
  SELECT 
    t.entityid, 
    g.initial, 
    g.latest, 
    i.inserted AS created, 
    t.inserted AS updated, 
    t.type, 
    t.status,
    t.amount,
    t.abs_amount,
    t.is_credit,
    t.balance,
    t.description
  FROM transactions_grouped_by_initial_transaction g 
  INNER JOIN transactions t on t.transactionid = g.latest
  INNER JOIN transactions i on i.transactionid = g.initial 
  WHERE g.entityid = 204425;

效果很好！

 Nested Loop  (cost=100.87..305.05 rows=10 width=116) 
              (actual time=4.113..16.646 rows=71 loops=1)
   Output: t.entityid, g.initial, g.latest, i.inserted, 
           t.inserted, t.type, t.status, t.amount, t.abs_amount, 
           t.balance, t.description
   ->  Nested Loop  (cost=100.43..220.42 rows=10 width=108) 
                    (actual time=4.017..10.725 rows=71 loops=1)
         Output: g.initial, g.latest, t.entityid, t.inserted, 
                 t.type, t.status, t.amount, t.abs_amount, t.is_credit,
                 t.balance, t.description
     ->  Foreign Scan on public.transactions_grouped_by_initial_transaction g
                 (cost=100.00..135.80 rows=10 width=16) 
                 (actual time=3.914..4.694 rows=71 loops=1)
            Output: g.entityid, g.initial, g.latest
            Remote SQL: 
              SELECT initial, latest
              FROM public.foreign_transactions_grouped_by_initial_transaction
              WHERE ((entityid = 204425))
         ->  Index Scan using transactions_pkey on public.transactions t  
                  (cost=0.43..8.45 rows=1 width=100) 
                  (actual time=0.023..0.035 rows=1 loops=71)
               Output: t.entityid, t.transactionid, t.type, t.status, 
                       t.related, t.amount, t.abs_amount, t.is_credit, 
                       t.inserted, t.description, t.balance
               Index Cond: (t.transactionid = g.latest)
   ->  Index Scan using transactions_pkey on public.transactions i  
            (cost=0.43..8.45 rows=1 width=16) 
            (actual time=0.021..0.033 rows=1 loops=71)
         Output: i.entityid, i.transactionid, i.type, i.status, 
                 i.related, i.amount, i.abs_amount, i.is_credit, 
                 i.inserted, i.description, i.balance
         Index Cond: (i.transactionid = g.initial)
 Total runtime: 20.363 ms

问题：

但是，当我尝试将其烘焙到 VIEW 中（有或没有另一层postgres_fdw）时，查询优化器似乎没有通过 WHERE 子句:-(

CREATE view collapsed_transactions_fast AS
  SELECT 
    t.entityid, 
    g.initial, 
    g.latest, 
    i.inserted AS created, 
    t.inserted AS updated, 
    t.type, 
    t.status,
    t.amount,
    t.abs_amount,
    t.is_credit,
    t.balance,
    t.description
  FROM transactions_grouped_by_initial_transaction g 
  INNER JOIN transactions t on t.transactionid = g.latest
  INNER JOIN transactions i on i.transactionid = g.initial;

EXPLAIN ANALYSE VERBOSE
  SELECT * FROM collapsed_transactions_fast WHERE entityid = 204425;

结果是：

Nested Loop  (cost=534.97..621.88 rows=1 width=117) 
             (actual time=104720.383..139307.940 rows=71 loops=1)
  Output: t.entityid, g.initial, g.latest, i.inserted, t.inserted, t.type, 
          t.status, t.amount, t.abs_amount, t.is_credit, t.balance, 
          t.description
  ->  Hash Join  (cost=534.53..613.66 rows=1 width=109) 
                 (actual time=104720.308..139305.522 rows=71 loops=1)
        Output: g.initial, g.latest, t.entityid, t.inserted, t.type, 
                t.status, t.amount, t.abs_amount, t.is_credit, t.balance, 
                t.description
        Hash Cond: (g.latest = t.transactionid)
    ->  Foreign Scan on public.transactions_grouped_by_initial_transaction g
         (cost=100.00..171.44 rows=2048 width=16) 
         (actual time=23288.569..108916.051 rows=3705600 loops=1)
           Output: g.entityid, g.initial, g.latest
           Remote SQL: 
            SELECT initial, latest 
            FROM public.foreign_transactions_grouped_by_initial_transaction
        ->  Hash  (cost=432.76..432.76 rows=142 width=101) 
                  (actual time=2.103..2.103 rows=106 loops=1)
              Output: 
                t.entityid, t.inserted, t.type, t.status, t.amount, 
                t.abs_amount, t.is_credit, t.balance, t.description, 
                t.transactionid
              Buckets: 1024  Batches: 1  Memory Usage: 14kB
              ->  Index Scan using "transactions by entityid" 
                  on public.transactions t  
                     (cost=0.43..432.76 rows=142 width=101) 
                     (actual time=0.049..1.241 rows=106 loops=1)
                    Output: t.entityid, t.inserted, t.type, t.status, 
                            t.amount, t.abs_amount, t.is_credit, 
                            t.balance, t.description, t.transactionid
                    Index Cond: (t.entityid = 204425)
  ->  Index Scan using transactions_pkey on public.transactions i  
        (cost=0.43..8.20 rows=1 width=16) 
        (actual time=0.013..0.018 rows=1 loops=71)
        Output: i.entityid, i.transactionid, i.type, i.status, i.related, 
                i.amount, i.abs_amount, i.is_credit, i.inserted, i.description, 
                 i.balance
        Index Cond: (i.transactionid = g.initial)
Total runtime: 139575.140 ms

如果我可以将该行为烘焙到 VIEW 或 FDW 中，那么我可以在极少数查询中替换 VIEW的名称，以提高效率。我不在乎对于其他用例（更复杂的 WHERE 子句）是否超级慢，我将命名 VIEW 以反映其预期用途。

有其use_remote_estimate默认值，FALSE但无论哪种方式都没有区别。

问题：

我可以使用一些技巧来使这个公认的黑客工作吗？

score 2 · Accepted Answer

如果我正确理解了你的问题，答案是“不”。没有“技巧”来获得额外的 where 子句通过 fdw 包装器。

但是，我认为您可能正在优化错误的东西。

我会替换整个collapsed_transactions视图。除非我遗漏了什么，否则它只取决于交易表。创建一个表，使用触发器对其进行更新，并且只向普通用户授予 SELECT 权限。如果您还没有从pgtap获取一些测试工具，那么您就可以开始了。

编辑：优化视图。

如果您只想针对视图优化该查询，并且可以调整视图的定义，请尝试以下操作：

CREATE VIEW collapsed_transactions AS
SELECT
    g.entityid,  -- THIS HERE
    g.initial,
    g.latest,
    i.inserted AS created,
    t.inserted AS updated,
    t.type,
    t.status,
    t.amount,
    t.abs_amount,
    t.is_credit,
    t.balance,
    t.description
FROM (
    SELECT 
    entityid, -- THIS HERE
    COALESCE(x.related, x.transactionid) AS initial,
    max(x.transactionid) AS latest
    FROM transactions x
    GROUP BY entityid, COALESCE(x.related, x.transactionid)
) g
INNER JOIN transactions t ON t.transactionid = g.latest
INNER JOIN transactions i ON i.transactionid = g.initial;

请注意，子查询公开 entityid 并允许我们对其进行过滤。我假设 entityid 对于主要和相关项目是恒定的，否则我看不到查询如何工作。这应该让计划者对问题有足够的把握，以便首先使用 entityid 上的索引，并将查询时间缩短到毫秒。

postgresql - 使用 postgres_fdw 加速包含多个自加入分组的视图

1 回答 1

Related

Reference