Branko 接受的解决方案很棒(谢谢!)。但是,我想提供一个性能相同的替代方案(根据我的测试),并且可能更易于可视化。
让我们回顾一下。最初的问题也许可以概括如下:
给定一个 id 和相对权重的映射,创建一个查询,该查询在映射中返回一个随机 id,但概率与其相对权重成正比。
注意强调相对权重,而不是百分比。正如布兰科在他的回答中指出的那样,使用相对权重适用于任何事情,包括百分比。
现在,考虑一些测试数据,我们将把它们放在一个临时表中:
CREATE TEMP TABLE test AS
SELECT * FROM (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
) AS test(id, weight);
请注意,我使用的示例比原始问题中的示例更复杂,因为它不能方便地加起来为 100,并且相同的权重(20) 被多次使用(对于 id 2 和 3),这一点很重要,稍后您会看到。
我们要做的第一件事就是将权重变成从 0 到 1 的概率,这无非是一个简单的归一化(权重 / 总和(权重)):
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
这将导致以下输出:
id | weight | probability | startprobability | endprobability
----+--------+-------------+------------------+----------------
1 | 25 | 0.5 | 0.0 | 0.5
2 | 10 | 0.2 | 0.5 | 0.7
3 | 10 | 0.2 | 0.7 | 0.9
4 | 5 | 0.1 | 0.9 | 1.0
诚然,上面的查询所做的工作超出了我们的需要,但我发现以这种方式可视化相对概率很有帮助,并且它确实使选择 id 的最后一步变得微不足道:
SELECT id FROM (queryabove)
WHERE random() BETWEEN startprobability AND endprobability;
现在,让我们将它们与一个测试放在一起,以确保查询返回具有预期分布的数据。我们将用于generate_series()
生成一个随机数一百万次:
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
),
fp AS ( -- final probability
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
)
SELECT *
FROM fp
CROSS JOIN (SELECT random() FROM generate_series(1, 1000000)) AS random(val)
WHERE random.val BETWEEN fp.startprobability AND fp.endprobability
;
这将导致类似于以下的输出:
id | count
----+--------
1 | 499679
3 | 200652
2 | 199334
4 | 100335
如您所见,它完美地跟踪了预期分布。
表现
上面的查询非常高效。即使在我的普通机器上,PostgreSQL 在 WSL1 实例中运行(太可怕了!),执行速度也相对较快:
count | time (ms)
-----------+----------
1,000 | 7
10,000 | 25
100,000 | 210
1,000,000 | 1950
适应生成测试数据
在为单元/集成测试生成测试数据时,我经常使用上述查询的变体。这个想法是生成近似于跟踪现实的概率分布的随机数据。
在这种情况下,我发现计算一次开始和结束分布并将结果存储在表中很有用:
CREATE TEMP TABLE test AS
WITH test(id, weight) AS (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
),
p AS ( -- probability
SELECT *, (weight::NUMERIC / sum(weight) OVER ()) AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
然后我可以重复使用这些预先计算的概率,从而获得额外的性能和更简单的使用。
我什至可以将它全部包装在一个函数中,我可以在任何时候调用它来获取随机 id:
CREATE OR REPLACE FUNCTION getrandomid(p_random FLOAT8 = random())
RETURNS INT AS
$$
SELECT id
FROM test
WHERE p_random BETWEEN startprobability AND endprobability
;
$$
LANGUAGE SQL STABLE STRICT
窗口功能框架
值得注意的是,上面的技术是使用带有非标准 frame 的窗口函数ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
。这是处理某些权重可能重复的事实所必需的,这就是为什么我首先选择具有重复权重的测试数据!