sql - SQL查询有效地选择不完美的重复项

Question

我有一个实体属性值格式的数据库表，如下所示：

放射科表

我希望选择“实体”和“属性”列具有相同值但“值”列具有不同值的所有行。对于所有三列具有相同值的多行应被视为单行。我实现这一点的方法是使用 SELECT DISTINCT。

SELECT entity_id, attribute_name, COUNT(attribute_name) AS NumOcc 
FROM (SELECT DISTINCT * FROM radiology) x 
GROUP BY entity_id,attribute_name 
HAVING COUNT(attribute_name) > 1

此查询的响应

但是，我读过使用 SELECT DISTINCT 的成本很高。我计划在非常大的表上使用此查询，我正在寻找一种优化此查询的方法，也许不使用 SELECT DISTINCT。

我正在使用 PostgreSQL 10.3

score 1 · Accepted Answer

select  *
from    radiology r
join    (
        select  entity_id
        ,       attribute_name
        from    radiology
        group by
                entity_id
        ,       attribute_name
        having  count(distinct value) > 1
        ) dupe
 on     r.entity_id = dupe.entity_id
        and r.attribute_name = dupe.attribute_name

score 0 · Accepted Answer

这应该适合你：

select a.* from radiology a join 
(select entity, attribute, count(distinct value) cnt
from radiology 
group by entity, attribute
having count(distinct value)>1)b
on a.entity=b.entity and a.attribute=b.attribute

score 0 · Accepted Answer

我希望选择“实体”和“属性”列具有相同值但“值”列具有不同值的所有行。

你的方法不这样做。我会认为exists：

select r.*
from radiology r
where exists (select 1
              from radiology r2
              where r2.entity = r.entity and r2.attribute = r.attribute and
                    r2.value <> r.value
             );

如果您只想要带有对的实体/属性值，请使用group by：

select entity, attribute
from radiology
group by entity, attribute
having min(value) <> max(value);

请注意，您可以使用having count(distinct value) > 1，但会比和count(distinct)产生更多开销。min()max()

sql - SQL查询有效地选择不完美的重复项

3 回答 3

Related

Reference