performance - 当 UTC 毫秒存储为 bigint 时，postgresql 中的查询时间很慢

Question

由于价格因素，我们正在从时间序列数据库（ECHO 历史数据库）迁移到开源数据库。我们的选择是 PostgreSQL，因为没有开源时间序列数据库。我们过去在 ECHO 中存储的只是时间和价值对。现在问题来了。我在 postgre 中创建的表由 2 列组成。第一个是“bigint”类型，以UTC毫秒（13位数字）存储时间，第二个是数据类型设置为“真实”类型的值。我已经填写了大约 360 万行（分布在 30 天的时间范围内）的数据，当我查询一个小时间范围（比如 1 天）时，查询需要 4 秒，但对于 ECHO 中的相同时间范围，响应时间是 150 毫秒！这是一个巨大的差异。有一个 bigint 时间似乎是缓慢的原因，但不确定。您能否建议如何改进查询时间。我还阅读了有关使用数据类型“timestamp”和“timestamptz”的信息，看起来我们需要将日期和时间存储为常规格式，而不是 UTC 秒。这可以帮助加快我的查询时间吗？

这是我的表定义：

            Table "public. MFC2 Flow_LCL "
Column  |  Type  | Modifiers | Storage | Stats target | Description  
----------+--------+-----------+---------+--------------+-------------

 the_time | bigint |           | plain   |              |
 value    | real   |           | plain   |              |

Indexes:
"MFC2 Flow_LCL _time_idx" btree (the_time)

Has OIDs: no

目前我以 UTC 毫秒为单位存储时间（使用 bigint）。这里的挑战是可能存在重复的时间值对。

这是我正在使用的查询（通过一个简单的 API 调用，它将传递表名、开始和结束时间）

PGresult *res;

int rec_count;
std::string sSQL;

sSQL.append("SELECT * FROM ");
sSQL.append(" \" ");
sSQL.append(table);
sSQL.append(" \" ");
sSQL.append(" WHERE");
sSQL.append(" time >= ");
CString sTime;
sTime.Format("%I64d",startTime);
sSQL.append(sTime);
sSQL.append(" AND time <= ");
CString eTime;
eTime.Format("%I64d",endTime);
sSQL.append(eTime);
sSQL.append(" ORDER BY time ");

res = PQexec(conn, sSQL.c_str());

score 0 · Accepted Answer

你真的已经在计划2038 年的问题了吗？为什么不像在标准 UNIX 中那样只使用 int 表示时间？

score 0 · Accepted Answer

您的时间序列数据库，如果它像我曾经检查过的竞争对手一样工作，它会按照“时间”列的顺序自动将数据存储在类似堆的结构中。Postgres没有。因此，您正在执行 O(n) 搜索 [n=表中的行数]：必须读取整个表以查找与您的时间过滤器匹配的行。时间戳上的主键（创建唯一索引），或者，如果时间戳不是唯一的，则常规索引将为您提供二进制 O(log n) 搜索单个记录并提高所有查询的性能，检索不到约 5%桌子。Postgres 将估计索引扫描或全表扫描更好的交叉点。

您可能还想CLUSTER（PG Docs）该索引上的表。

另外，请遵循上面的建议，不要使用time或其他 SQL 保留字作为列名。即使它是合法的，它也是在自找麻烦。

[这作为评论会更好，但它太长了。]

score 0 · Accepted Answer

SET search_path=tmp;

  -- -------------------------------------------
  -- create table and populate it with 10M rows
  -- -------------------------------------------
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;

SET search_path=tmp;

CREATE TABLE old_echo
        ( the_time timestamp NOT NULL PRIMARY KEY
        , payload DOUBLE PRECISION NOT NULL
        );

INSERT INTO old_echo (the_time, payload)
SELECT now() - (gs * interval '1 msec')
        , random()
FROM generate_series(1,10000000) gs
        ;

-- DELETE FROM old_echo WHERE random() < 0.8;

VACUUM ANALYZE old_echo;

SELECT MIN(the_time) AS first
        , MAX(the_time) AS last
        , (MAX(the_time) - MIN(the_time))::interval AS width
FROM old_echo
        ;

EXPLAIN ANALYZE
SELECT *
FROM old_echo  oe
JOIN (
        SELECT MIN(the_time) AS first
        , MAX(the_time) AS last
        , (MAX(the_time) - MIN(the_time))::interval AS width
        , ((MAX(the_time) - MIN(the_time))/2)::interval AS half
        FROM old_echo
        ) mima ON 1=1
WHERE oe.the_time >= mima.first + mima.half
AND  oe.the_time < mima.first + mima.half + '1 sec':: interval
        ;

结果：

                                                                               QUERY PLAN                                                                                
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.06..59433.67 rows=1111124 width=64) (actual time=0.101..1.307 rows=1000 loops=1)
   ->  Result  (cost=0.06..0.07 rows=1 width=0) (actual time=0.049..0.050 rows=1 loops=1)
         InitPlan 1 (returns $0)
           ->  Limit  (cost=0.00..0.03 rows=1 width=8) (actual time=0.022..0.022 rows=1 loops=1)
                 ->  Index Scan using old_echo_pkey on old_echo  (cost=0.00..284873.62 rows=10000115 width=8) (actual time=0.021..0.021 rows=1 loops=1)
                       Index Cond: (the_time IS NOT NULL)
         InitPlan 2 (returns $1)
           ->  Limit  (cost=0.00..0.03 rows=1 width=8) (actual time=0.009..0.010 rows=1 loops=1)
                 ->  Index Scan Backward using old_echo_pkey on old_echo  (cost=0.00..284873.62 rows=10000115 width=8) (actual time=0.009..0.009 rows=1 loops=1)
                       Index Cond: (the_time IS NOT NULL)
   ->  Index Scan using old_echo_pkey on old_echo oe  (cost=0.01..34433.30 rows=1111124 width=16) (actual time=0.042..0.764 rows=1000 loops=1)
         Index Cond: ((the_time >= (($0) + ((($1 - $0) / 2::double precision)))) AND (the_time < ((($0) + ((($1 - $0) / 2::double precision))) + '00:00:01'::interval)))
 Total runtime: 1.504 ms
(13 rows)

更新：由于时间戳似乎是非唯一的（顺便说一句：在这种情况下重复是什么意思？）我添加了一个额外的键列。一个丑陋的黑客，但它在这里工作。10M -80% 行的查询时间为 11ms。（行数达到 210/222067）：

CREATE TABLE old_echo
        ( the_time timestamp NOT NULL
        , the_seq SERIAL NOT NULL -- to catch the duplicate keys
        , payload DOUBLE PRECISION NOT NULL
        ,       PRIMARY KEY(the_time, the_seq)
        );

    -- Adding the random will cause some timestamps to be non-unique.
    -- (and others to be non-existent)
INSERT INTO old_echo (the_time, payload)
SELECT now() - ((gs+random()*1000::integer) * interval '1 msec')
        , random()
FROM generate_series(1,10000000) gs
        ;

DELETE FROM old_echo WHERE random() < 0.8;

performance - 当 UTC 毫秒存储为 bigint 时，postgresql 中的查询时间很慢

3 回答 3

Related

Reference