1

我有大约 3400 万行,每行在tpc-ds datasetstore_sales的表中 有 23 列。

我有一个composite primary keyss_item_skss_ticket_number.

运行查询后SELECT count(DISTINCT <primary key>) ..,我可以看到它输出了表中存在的总行数。

现在我与 一起添加另一列primary key,即ss_sold_date_sk

在此之后,如果我运行count查询,我得到的打印行数比以前少。有人可以通过示例向我解释为什么会发生这种情况吗?

TL;博士

何时向复合主键添加列不再使其唯一?

4

1 回答 1

2

演示

create table mytable (c1 string,c2 string);
insert into mytable values ('A','A'),('B',null),('C',null);

select count(distinct c1) as count_distinct from mytable;

+----------------+
| count_distinct |
+----------------+
|              3 |
+----------------+

正如预期的那样 - 3 个不同的值 - 'A'、'B' 和 'C'


select count(distinct concat(c1,c2)) as count_distinct from mytable;

+----------------+
| count_distinct |
+----------------+
|              1 |
+----------------+
    

正如预期的那样。为什么?- 见下一个查询


select c1,c2,concat(c1,c2) as concat_c1_c2 from mytable;

+----+------+--------------+
| c1 |  c2  | concat_c1_c2 |
+----+------+--------------+
| A  | A    | AA           |
| B  | NULL | NULL         |
| C  | NULL | NULL         |
+----+------+--------------+

与 NULL 连接产生 NULL


select count(distinct c1,c2) as count_distinct from mytable;

+----------------+
| count_distinct |
+----------------+
|              1 |
+----------------+

漏洞!!


这是解决该错误的方法:

select count(distinct struct(c1,c2)) as count_distinct from mytable;

+----------------+
| count_distinct |
+----------------+
|              3 |
+----------------+
于 2017-02-23T22:59:39.760 回答