sql - Update VERY LARGE PostgreSQL database table efficiently

Question

I have a very large database table in PostgresQL and a column like "copied". Every new row starts uncopied and will later be replicated to another thing by a background programm. There is an partial index on that table "btree(ID) WHERE replicated=0". The background programm does a select for at most 2000 entries (LIMIT 2000), works on them and then commits the changes in one transaction using 2000 prepared sql-commands.

Now the problem ist that I want to give the user an option to reset this replicated-value, make it all zero again.

An update table set replicated=0;

is not possible:

It takes very much time
It duplicates the size of the tabel because of MVCC
It is done in one transaction: It either fails or goes through.

I actually don't need transaction-features for this case: If the system goes down, it shall process only parts of it.

Several other problems: Doing an

update set replicated=0 where id >10000 and id<20000

is also bad: It does a sequential scan all over the whole table which is too slow. If it weren't doing that, it would still be slow because it would be too many seeks.

What I really need is a way of going through all rows, changing them and not being bound to a giant transaction.

Strangely, an

UPDATE table 
  SET replicated=0 
WHERE ID in (SELECT id from table WHERE replicated= LIMIT 10000)

is also slow, although it should be a good thing: Go through the table in DISK-order...

(Note that in that case there was also an index that covered this)

(An update LIMIT like Mysql is unavailable for PostgresQL)

BTW: The real problem is more complicated and we're talking about an embedded system here that is already deployed, so remote schema changes are difficult, but possible It's PostgresQL 7.4 unfortunately.

The amount of rows I'm talking about is e.g. 90000000. The size of the databse can be several dozend gigabytes.

The database itself only contains 5 tables, one is a very large one. But that is not bad design, because these embedded boxes only operate with one kind of entity, it's not an ERP-system or something like that!

Any ideas?

score 9 · Accepted Answer

How about adding a new table to store this replicated value (and a primary key to link each record to the main table). Then you simply add a record for every replicated item, and delete records to remove the replicated flag. (Or maybe the other way around - a record for every non-replicated record, depending on which is the common case).

That would also simplify the case when you want to set them all back to 0, as you can just truncate the table (which zeroes the table size on disk, you don't even have to vacuum to free up the space)

score 3 · Accepted Answer

如果您尝试重置整个表，而不仅仅是几行，通常更快（在非常大的数据集上 - 而不是在常规表上）简单CREATE TABLE bar AS SELECT everything, but, copied, 0 FROM foo，然后交换表并删除旧表。显然，您需要确保在执行此操作时没有任何内容被插入到原始表中。您还需要重新创建该索引。

编辑：一个简单的改进，以避免在复制 14 GB 时锁定表：

lock ;
create a new table, bar;
swap tables so that all writes go to bar;
unlock;
create table baz as select from foo;
drop foo;
create the index on baz;
lock;
insert into baz from bar;
swap tables;
unlock;
drop bar;

（让您在进行复制时进行写入，并在事后插入它们）。

score 2 · Accepted Answer

虽然你不可能解决空间使用问题（它是暂时的，直到真空）你可能真的可以在时钟时间方面加快这个过程。PostgreSQL 使用 MVCC 的事实意味着您应该能够做到这一点，而不会出现与新插入的行相关的任何问题。create table as select 将解决一些性能问题，但不允许继续使用该表，并且占用同样多的空间。只需放弃索引并重建它，然后进行真空处理。

drop index replication_flag;
update big_table set replicated=0;
create index replication_flag on big_table btree(ID) WHERE replicated=0;
vacuum full analyze big_table;

score 1 · Accepted Answer

这是伪代码。您将需要 400MB（对于 int）或 800MB（对于 bigints）临时文件（如果有问题，您可以使用 zlib 对其进行压缩）。它需要对一张桌子进行大约 100 次扫描以进行真空吸尘器。但它不会使表膨胀超过 1%（任何时候最多 1000000 个死行）。您还可以用更少的扫描来换取更多的表膨胀。

// write all ids to temporary file in disk order                
// no where clause will ensure disk order
$file = tmpfile();
for $id, $replicated in query("select id, replicated from table") {
        if ( $replicated<>0 ) {
                write($file,&$id,sizeof($id));
        }
}

// prepare an update query
query("prepare set_replicated_0(bigint) as
        update table set replicated=0 where id=?");

// reread this file, launch prepared query and every 1000000 updates commit
// and vacuum a table
rewind($file);
$counter = 0;
query("start transaction");
while read($file,&$id,sizeof($id)) {
        query("execute set_replicated_0($id)");
        $counter++;
        if ( $counter % 1000000 == 0 ) {
                query("commit");
                query("vacuum table");
                query("start transaction");
        }
}
query("commit");
query("vacuum table");
close($file);

score 0 · Accepted Answer

我想你需要做的是一个。将 2000 条记录的 PK 值复制到具有相同标准限制等的临时表中。 b. 选择相同的 2000 条记录并按原样在游标中执行必要的操作。C。如果成功，请针对临时表中的记录运行单个更新查询。清除临时表并再次运行步骤 a。d。如果不成功，请清除临时表而不运行更新查询。简单、高效、可靠。问候， KT

score 0 · Accepted Answer

我认为最好将您的 postgres 更改为 8.X 版本。可能原因是 Postgres 的低版本。也试试下面的这个查询。我希望这会有所帮助。

UPDATE table1 SET name = table2.value
FROM table2 
WHERE table1.id = table2.id;

sql - Update VERY LARGE PostgreSQL database table efficiently

6 回答 6

Related

Reference