sql - 如何从不定数量的组中聚合信息

Question

如何从 TSQL 中不定数量的组中聚合信息？例如，我们有一个包含 2 列的表 - 客户和区域。

Clients Regions
client1 45
client1 45
client1 45
client1 45
client1 43
client1 42
client1 41
client2 45
client2 45
client3 43
client3 43
client3 41
client3 41
client3 41
client3 41

每个客户端都可以有任意数量的区域。

在下面的示例中：client1 有 4 组区域，第 2 - 1 组，第 3 - 2 组。

我想计算每个客户的基尼杂质，即计算 - 客户区域的差异有多大。

为此，我想对每个客户应用以下公式：

1 - ((% of region1 among all the regions in the client) ^ 2 + 
     (% of region2 among all the regions in the client) ^ 2 + 
   … (% of regionN among all the regions in the client) ^ 2)

但是区域的数量是不确定的（每个客户可能不同）。

这应该计算：

client1 = 1 - ((4 / 7 ) ^ 2 + (1 / 7 ) ^ 2 + (1 / 7 ) ^ 2  + (1 / 7 ) ^ 2)
client2 = 1 - ((2 / 2 ) ^ 2)
client3 = 1 - ((2 / 6 ) ^ 2 +  (4 / 6 ) ^ 2)

这是理想的输出：

Clients Impurity
client1 61%
client2 0%
client3 44%

你能提示我解决问题的方法吗？

score 4 · Accepted Answer

我认为该公式可以通过以下方式表示为一组：

WITH cte AS (
    SELECT Clients
         , CAST(COUNT(*) AS DECIMAL(10, 0)) / SUM(COUNT(*)) OVER(PARTITION BY Clients) AS tmp
    FROM t
    GROUP BY Clients, Regions
)
SELECT Clients
     , 100 * (1 - SUM(tmp * tmp)) AS GI
FROM cte
GROUP BY Clients

db<>fiddle似乎与预期的输出相匹配。

score 1 · Accepted Answer

以下是我的处理方法：

在子子查询中，执行count(*) as cnt ... group by clients, regions
在子查询中，执行 acast(cnt as float)/sum(cnt) over(partition by clients) as pcnt并将其平方
在外部查询中做一个1 - sum(pcnt) ... group by clients

有一些方法可以压缩它以不使用 2 个子查询，但它们可能不会使其更具可读性或易于理解。我并不完全清楚您是指百分比（100 分）还是比率（1 分），所以您可能需要*100在适当的位置添加 a

sql - 如何从不定数量的组中聚合信息

2 回答 2

Related

Reference