sql - PostgreSQL/SQL 查询优化

Question

所以，我有大约 8M 记录的日志表。由于编程错误，公司在同一日期内发生了超过 1 条记录。现在，我需要从该日志中删除每个公司在同一日期的所有记录，除了最新的（具有最大 ID）。要删除的记录数大约为 300K。

我尝试过的最快和最简单的事情就是这个

delete from indexing_log where id not in (
select max(id)
from indexing_log
group by company_id,
"date"
)

但是这个查询在生产服务器（由于某种原因没有 ssd 驱动器）上花费了大量时间（大约 3 天）。我尝试了所有我知道并需要一些建议的方法。怎么可能更快？

更新我决定通过 celery 任务以桶的方式进行。

score 2 · Accepted Answer

将不同的行转储到临时表

create temporary table t as
select distinct on (company_id, "date") *
from indexing_log
order by company_id, "date", id desc;

截断原文

truncate table indexing_log;

由于表现在是空的，所以利用这个机会做一个瞬时vacuum：

vacuum full indexing_log;

将行从临时移动到原始行

insert into indexing_log
select *
from t;

score 2 · Accepted Answer

你可以试试

delete from indexing_log as l
where
    exists
    (
        select *
        from indexing_log as i
        where i.id < l.id and i.company_id = l.company_id and i.dt = l.dt
    );

score 1 · Accepted Answer

截断表应该更快。但是在那里你不能说“删除除...之外的所有内容”如果你的数据可能，你可以为此编写一个过程，将你的 Max ID 保存到一个临时表中，整理表并将你的临时表写回。对于 PostgreSQL，语法略有不同（http://www.postgresql.org/docs/9.1/static/sql-selectinto.html）

SELECT * from indexing_log 
INTO #temptable 
WHERE id IN (
    SELECT max(id)
    FROM indexing_log
    GROUP BY company_id,
    "date")

score 1 · Accepted Answer

Not Exists有时比Not in

delete from indexing_log 
where not exists (select 1
                    from (select max(id) as iid
                            from indexing_log
                           group by company_id,
                                 "date") mids
                   where id = mids.iid
                 )

sql - PostgreSQL/SQL 查询优化

4 回答 4

Related

Reference