sql - PostgreSQL - 查找具有特定值的最旧记录

Question

我有一个文档管理系统，可以在历史表中记录所有历史事件。我被要求能够在给定日期为给定客户提供状态为 5 的最旧 doc_id。该表看起来像这样（为简单起见被截断）：

doc_history:
    id integer
    doc_id integer
    event_date timestamp
    client_id integer
    status_id integer

client_id 和 status_id 列是事件发生后文档的值。这意味着由 doc_id 定义的文档的最大历史事件行将匹配文档表中的相同列。通过特定事件日期限制事件，您可以查看当时文档的值。因为这些值不是静态的，所以我不能简单地搜索 status_id 为 5 的特定 client_id，因为找到的结果可能与文档的 max(id) 不匹配。希望这有点道理。

我发现工作但很慢的是以下内容：

select
    t.*
from
    (select
        distinct on (doc_id),
        *
    from
        doc_history
    where
        event_date <= '2013-02-17 23:59:59'
    order by
        doc_id, id desc) t
where
    t.client_id = 9999 and
    t.status_id = 5
limit 1;

基本上，我在给定的最大事件日期之前获取特定文档 ID 的最大 ID，然后验证该最大历史项目是否已分配给给定客户端，状态设置为 5。

以我的方式这样做的缺点是我正在扫描所有客户的所有历史记录以获取它们的最大值，然后找到我正在寻找的一个客户和状态。截至目前，这扫描了大约 1506 万行，在我的开发服务器上大约需要 90 秒（这不是很快）。

为了使事情变得更复杂，我需要在前一周的每一天都这样做，或者每次运行总共七次。此外，系统中的所有单据都以状态 5 开头，表示新单据。这使得该查询将只返回为该客户端输入的第一个文档：

select * from doc_history where client_id = 9999 and
    status_id = 5 and
    event_date <= '2013-02-17 23:59:59'
    order by id limit 1;

我希望做的是扫描，直到找到与特定客户端和状态值匹配的特定文档的最大历史记录，而不必首先找到所有客户端的所有文档 ID 的最大 ID。我不知道这是否可以通过窗口函数（分区）或我目前没有看到的其他逻辑来完成。

doc_history 表中的事件之一的示例：

# select id, doc_id, event, old_value, new_value, event_date, client_id, status_id from doc_history where doc_id = 9999999 order by id;
    id    | doc_id  | event | old_value | new_value |         event_date         | client_id | status_id
----------+---------+-------+-----------+-----------+----------------------------+-----------+-----------
 25362415 | 9999999 |    13 |           |           | 2013-02-14 11:49:50.032824 |      9999 |         5
 25428192 | 9999999 |    15 |           |           | 2013-02-18 11:15:48.272542 |      9999 |         5
 25428193 | 9999999 |     7 | 5         | 1         | 2013-02-18 11:15:48.301377 |      9999 |         1

事件 7 是状态改变，新旧值显示从 5 变为 1，这在 status_id 列中有所体现。对于小于或等于 2013-02-17 23:59:59 的 event_date，上述记录将是 status_id 为 5 的最旧的“NEW”文档，但在 2013 年 2 月 17 日之后就没有了。

score 3 · Accepted Answer

这应该更快：

SELECT *
FROM   doc_history h1
WHERE  event_date < '2013-02-18 0:0'::timestamp
AND    client_id = 9999
AND    status_id = 5
AND NOT EXISTS (
   SELECT 1
   FROM   doc_history h2
   WHERE  h2.doc_id = h1.doc_id
   AND    h2.event_date < '2013-02-18 0:0'::timestamp
   AND    h2.event_date > h1.event_date  -- use event_date instead of id!
   )
ORDER  BY doc_id
LIMIT  1;

我很难理解你的描述。基本上，正如我现在所理解的那样，您希望在给定时间戳之前具有给定时间戳最大的行doc_id，(client_id, status_id)其中event_date不存在具有更高id（等于 later event_date）相同的其他行doc_id。

请注意我如何替换您示例中的条件：

WHERE  event_date <= '2013-02-17 23:59:59'

和：

WHERE  event_date < '2013-02-18 0:0'

由于您有小数秒，因此您的表达式会因以下时间戳而失败：
'2013-02-17 23:59:59.123'

我在半连接中使用h2.event_date > h1.event_date而不是，因为我认为稍后假设更大的等于是不明智的。你可能应该独自依靠。h2.id > h1.idNOT EXISTSidevent_dateevent_date

为了加快速度，您需要一个形式为（更新）的多列索引：

CREATE INDEX doc_history_multi_idx
ON doc_history (client_id, status_id, doc_id, event_date DESC);

doc_id, event_date DESC我在您的反馈后切换了位置，这应该更好地适应ORDER BY doc_id LIMIT 1.

如果条件status_id = 5不变（您总是检查5），则部分索引应该更快，但是：

CREATE INDEX doc_history_multi_idx
ON doc_history (client_id, doc_id, event_date DESC)
WHERE status_id = 5;

和：

CREATE INDEX doc_history_id_idx ON doc_history (doc_id, event_date DESC);

score 1 · Accepted Answer

在给定日期为给定客户端提供状态为 5 的最旧 doc_id

这将做到：

select
    min(doc_id) doc_id
from
    doc_history
where
    client_id = 9999
    and status_id = 5
    and date event_date = '2013-02-17'

我不止一次地阅读了你的问题，无法理解你在说什么。

score 0 · Accepted Answer

如果我做对了，您的等效且可能很快的查询将是：

select t.*
from doc_history
where event_date <= '2013-02-17 23:59:59' and
    t.client_id = 9999 and
    t.status_id = 5
order by doc_id, id desc
limit 1;

sql - PostgreSQL - 查找具有特定值的最旧记录

3 回答 3

Related

Reference