Vertica 允许将重复项插入到表中。我可以使用“analyze_constraints”功能查看那些。如何从 Vertica 表中删除重复行?
6 回答
您应该尽量避免/限制对大量记录使用 DELETE。以下方法应该更有效:
第 1 步创建一个新表,其结构/投影与包含重复项的表相同:
create table mytable_new like mytable including projections ;
步骤 2将重复数据删除的行插入此新表:
insert /* +direct */ into mytable_new select <column list> from (
select * , row_number() over ( partition by <pk column list> ) as rownum from <table-name>
) a where a.rownum = 1 ;
步骤 3重命名原始表(包含 dups 的表):
alter table mytable rename to mytable_orig ;
第 4 步重命名新表:
alter table mytable_new rename to mytable ;
就这样。
在我的脑海中,这不是一个很好的答案,所以让我们把它作为最后的话,你可以删除两者并重新插入一个。
Mauro 的回答是正确的,但是步骤 2 的 sql 中有错误。所以,避免 DELETE 的完整工作方式应该如下:
第 1 步创建一个新表,其结构/投影与包含重复项的表相同:
create table mytable_new like mytable including projections ;
步骤 2将重复数据删除的行插入此新表:
insert /* +direct */ into mytable_new select <column list> from (
select * , row_number() over ( partition by <pk column list> ) as rownum from mytable
) a where a.rownum = 1 ;
步骤 3重命名原始表(包含 dups 的表):
alter table mytable rename to mytable_orig ;
第 4 步重命名新表:
alter table mytable_new rename to mytable ;
您可以通过创建临时表并生成伪 row_ids 来删除 Vertica 表中的重复项。这里有几个步骤,特别是如果您要从非常大和宽的表中删除重复项。在下面的示例中,我假设 k1 和 k2 行有超过 1 个重复项。有关更多信息,请参见此处。
-- Find the duplicates
select keys, count(1) from large-table-1
where [where-conditions]
group by 1
having count(1) > 1
order by count(1) desc ;
-- Step 2: Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;
alter table test.large-table-1-dups -- add row_num column (pseudo row_id)
add column row_num int;
insert into test.large-table-1-dups
select *, ROW_NUMBER() OVER(PARTITION BY key)
from large-table-1
where key in ('k1', 'k2'); -- where, say, k1 has n and k2 has m exact dups
-- Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;
select * from test.dim_line_items_dups;
-- Sanity test. Should have 1 row each of k1 & k2 rows above
-- Step 4: Delete all duplicates from main table...
delete from large-table-1
where key in ('k1', 'k2');
-- Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;
insert into large-table-1
select * from test.large-table-1-dups;
步骤1:创建一个中间表来移植/加载原始表中的数据以及行号。在下面的示例中,将数据从 Table1 移植到 Table2 以及 row_num 列
select * into Table2 from (select *, ROW_NUMBER() OVER(PARTITION BY A,B order by C)as row_num from Table1 ) A;
Step2:使用之前在上述步骤中创建的 Table2 从 Table1 中删除数据
DELETE FROM Table1 WHERE EXISTS (SELECT NULL FROM Table2
where Table2.A=Table1.A
and Table2.B=Table1.B
and row_num > 1);
Step3:在第一步中删除表创建,即Table2
Drop Table Table2;
您应该从PostgreSQL wiki中查看这个答案,该答案也适用于 Vertica:
DELETE
FROM
tablename
WHERE
id IN(
SELECT
id
FROM
(
SELECT
id,
ROW_NUMBER() OVER(
partition BY column1,
column2,
column3
ORDER BY
id
) AS rnum
FROM
tablename
) t
WHERE
t.rnum > 1
);
它会删除所有重复的条目,但 id 最低的条目除外。