2

我在表格中有一个数据集。

id  |   attribute
-----------------
1   |   a
2   |   b
2   |   a
2   |   a
3   |   c

期望的输出:

attribute|  num
-------------------
a        |  1
b,a      |  1
c        |  1

在 MySQL 中,我会使用:

select attribute, count(*) num 
from 
   (select id, group_concat(distinct attribute) attribute from dataset group by id) as     subquery 
group by attribute;

我不确定这可以在 Redshift 中完成,因为它不支持 group_concat 或任何 psql 组聚合函数,如 array_agg() 或 string_agg()。看到这个问题

另一种可行的解决方案是,如果我有办法从每个组中选择一个随机属性而不是 group_concat。这如何在 Redshift 中工作?

4

4 回答 4

2

我找到了一种为每个 id 获取随机属性的方法,但这太棘手了。实际上,我认为这不是一个好方法,但它确实有效。

SQL:

-- (1) uniq dataset 
WITH uniq_dataset as (select * from dataset group by id, attr)
SELECT 
  uds.id, rds.attr
FROM
-- (2) generate random rank for each id
  (select id, round((random() * ((select count(*) from uniq_dataset iuds where iuds.id = ouds.id) - 1))::numeric, 0) + 1 as random_rk from (select distinct id from uniq_dataset) ouds) uds,
-- (3) rank table
  (select rank() over(partition by id order by attr) as rk, id ,attr from uniq_dataset) rds
WHERE
  uds.id = rds.id
AND 
  uds.random_rk = rds.rk
ORDER BY
  uds.id;

结果:

 id | attr
----+------
  1 | a
  2 | a
  3 | c

OR

 id | attr
----+------
  1 | a
  2 | b
  3 | c

这是此 SQL 中的表。

-- dataset (original table)
 id | attr
----+------
  1 | a
  2 | b
  2 | a
  2 | a
  3 | c

-- (1) uniq dataset
 id | attr
----+------
  1 | a
  2 | a
  2 | b
  3 | c

-- (2) generate random rank for each id
 id | random_rk
----+----
  1 |  1
  2 |  1 <- 1 or 2
  3 |  1

-- (3) rank table
 rk | id | attr
----+----+------
  1 |  1 | a
  1 |  2 | a
  2 |  2 | b
  1 |  3 | c
于 2013-11-27T08:26:15.703 回答
0

这个受 Masashi 启发的解决方案更简单,并且可以从 Redshift 的组中选择一个随机元素。

SELECT id, first_value as attribute 
FROM(SELECT id, FIRST_VALUE(attribute) 
    OVER(PARTITION BY id ORDER BY random() 
    ROWS BETWEEN unbounded preceding AND unbounded following) 
    FROM dataset) 
GROUP BY id, attribute ORDER BY id;
于 2014-01-19T04:18:18.417 回答
0

这是此处相关问题的答案。这个问题已经结束,所以我在这里发布答案。

这是一种将列聚合为字符串的方法:

select * from temp;
 attribute 
-----------
 a
 c
 b

1)给每一行一个唯一的排名

with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select * from sub_table;

 attribute | rnk 
-----------+-----
 a         |   1
 b         |   2
 c         |   3

2) 使用连接运算符 || 合并成一行

with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select (select attribute from sub_table where rnk = 1)||
       (select attribute from sub_table where rnk = 2)||
       (select attribute from sub_table where rnk = 3) res_string;

 res_string 
------------
 abc

这仅适用于该列中有限数量的行 (X)。它可以是“order by”子句中按某个属性排序的前 X 行。我猜这很贵。

Case 语句可用于处理当某个等级不存在时出现的 NULL。

with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select (select attribute from sub_table where rnk = 1)||
       (select attribute from sub_table where rnk = 2)||
       (select attribute from sub_table where rnk = 3)||
       (case when (select attribute from sub_table where rnk = 4) is NULL then '' 
             else (select attribute from sub_table where rnk = 4) end) as res_string;
于 2014-02-15T02:19:13.677 回答
-2

我没有测试过这个查询,但是 Redshift 支持这些函数:

select id, arrary_to_string(array(select attribute from mydataset m where m.id=d.id),',') from mydataset d

于 2014-01-17T04:01:30.597 回答