26

如何根据所有候选行的应用权重随机选择 T-SQL 中的表行?

例如,我在一个表中有一组行,权重分别为 50、25 和 25(加起来为 100 但不需要),我想随机选择其中一个,其统计结果相当于相应的重量。

4

5 回答 5

18

Dane 的回答包括以引入平方律的方式进行自连接。(n*n/2)连接后的行,表中有 n 行。

更理想的是能够只解析表一次。

DECLARE @id int, @weight_sum int, @weight_point int
DECLARE @table TABLE (id int, weight int)

INSERT INTO @table(id, weight) VALUES(1, 50)
INSERT INTO @table(id, weight) VALUES(2, 25)
INSERT INTO @table(id, weight) VALUES(3, 25)

SELECT @weight_sum = SUM(weight)
FROM @table

SELECT @weight_point = FLOOR(((@weight_sum - 1) * RAND() + 1))

SELECT
    @id = CASE WHEN @weight_point < 0 THEN @id ELSE [table].id END,
    @weight_point = @weight_point - [table].weight
FROM
    @table [table]
ORDER BY
    [table].Weight DESC

这将通过表格,设置@id为每条记录的id值,同时递减@weight点。最终,@weight_point意志变负。这意味着SUM所有前面的权重都大于随机选择的目标值。这是我们想要的记录,所以从那时起我们设置@id为它自己(忽略表中的任何 ID)。

这只遍历表一次,但即使选择的值是第一条记录,也必须遍历整个表。因为平均位置是表格的一半(如果按权重排序则更少)编写循环可能会更快......(特别是如果权重在公共组中):

DECLARE @id int, @weight_sum int, @weight_point int, @next_weight int, @row_count int
DECLARE @table TABLE (id int, weight int)

INSERT INTO @table(id, weight) VALUES(1, 50)
INSERT INTO @table(id, weight) VALUES(2, 25)
INSERT INTO @table(id, weight) VALUES(3, 25)

SELECT @weight_sum = SUM(weight)
FROM @table

SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0)

SELECT @next_weight = MAX(weight) FROM @table
SELECT @row_count   = COUNT(*)    FROM @table WHERE weight = @next_weight
SET @weight_point = @weight_point - (@next_weight * @row_count)

WHILE (@weight_point > 0)
BEGIN
    SELECT @next_weight = MAX(weight) FROM @table WHERE weight < @next_weight
    SELECT @row_count   = COUNT(*)    FROM @table WHERE weight = @next_weight
    SET @weight_point = @weight_point - (@next_weight * @row_count)
END

-- # Once the @weight_point is less than 0, we know that the randomly chosen record
-- # is in the group of records WHERE [table].weight = @next_weight

SELECT @row_count = FLOOR(((@row_count - 1) * RAND() + 1))

SELECT
    @id = CASE WHEN @row_count < 0 THEN @id ELSE [table].id END,
    @row_count = @row_count - 1
FROM
    @table [table]
WHERE
    [table].weight = @next_weight
ORDER BY
    [table].Weight DESC
于 2009-01-18T01:26:24.850 回答
8

您只需将所有候选行的权重相加,然后在该总和中选择一个随机点,然后选择与该所选点协调的记录(每条记录都递增地携带一个累积的权重总和)。

DECLARE @id int, @weight_sum int, @weight_point int
DECLARE @table TABLE (id int, weight int)

INSERT INTO @table(id, weight) VALUES(1, 50)
INSERT INTO @table(id, weight) VALUES(2, 25)
INSERT INTO @table(id, weight) VALUES(3, 25)

SELECT @weight_sum = SUM(weight)
FROM @table

SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0)

SELECT TOP 1 @id = t1.id
FROM @table t1, @table t2
WHERE t1.id >= t2.id
GROUP BY t1.id
HAVING SUM(t2.weight) >= @weight_point
ORDER BY t1.id

SELECT @id
于 2008-09-12T07:45:06.160 回答
4

如果您有很多记录,那么“递增地携带一个累积 [原文如此] 重量总和”部分会很昂贵如果您也已经有广泛的分数/权重(即:范围足够宽,大多数记录权重都是唯一的。1-5 星可能不会削减它),您可以执行类似的操作来选择权重值. 我在这里使用 VB.Net 进行演示,但这也可以在纯 Sql 中轻松完成:

Function PickScore()
    'Assume we have a database wrapper class instance called SQL and seeded a PRNG already
    'Get count of scores in database
    Dim ScoreCount As Double = SQL.ExecuteScalar("SELECT COUNT(score) FROM [MyTable]")
    ' You could also approximate this with just the number of records in the table, which might be faster.

    'Random number between 0 and 1 with ScoreCount possible values
    Dim rand As Double = Random.GetNext(ScoreCount) / ScoreCount

    'Use the equation y = 1 - x^3 to skew results in favor of higher scores
    ' For x between 0 and 1, y is also between 0 and 1 with a strong bias towards 1
    rand = 1 - (rand * rand * rand)

    'Now we need to map the (0,1] vector to [1,Maxscore].
    'Just find MaxScore and mutliply by rand
    Dim MaxScore As UInteger = SQL.ExecuteScalar("SELECT MAX(Score) FROM Songs")
    Return MaxScore * rand
End Function

运行这个,并选择最大分数小于返回权重的记录。如果不止一条记录共享该分数,请随机选择它。这里的优点是您不必保留任何总和,并且您可以调整用于适合您的口味的概率方程。但同样,它在更大的分数分布情况下效果最好。

于 2008-09-12T13:41:07.040 回答
3

使用随机数生成器执行此操作的方法是集成概率密度函数。使用一组离散值,您可以计算前缀总和(直到这个值的所有值的总和)并存储它。使用此选项,您可以选择大于随机数的最小前缀总和(累计到日期)值。

在数据库上,必须更新插入后的后续值。如果更新的相对频率和数据集的大小不会使执行此操作的成本过高,则意味着可以从单个 s-argable(可以通过索引查找解决的谓词)查询中获得适当的值.

于 2008-09-18T14:57:55.477 回答
1

如果您需要获取一组样本(例如,您想从 5M 行的集合中抽取 50 行),其中每行都有一个名为Weightwhich的列,int并且较大的值意味着更多的权重,您可以使用此函数:

SELECT * 
FROM 
(
    SELECT TOP 50 RowData, Weight 
    FROM MyTable 
    ORDER BY POWER(RAND(CAST(NEWID() AS VARBINARY)), (1.0/Weight)) DESC
) X 
ORDER BY Weight DESC

这里的关键是使用 POWER( ) 函数,如此处所示

关于选择随机函数的参考在这里这里

或者,您可以使用:

1.0 * ABS(CAST(CHECKSUM(NEWID()) AS bigint)) / CAST(0x7FFFFFFF AS INT) 

由于这个问题BIGINT,您将校验和转换为:INT

因为校验和返回一个 int,并且 int 的范围是 -2^31 (-2,147,483,648) 到 2^31-1 (2,147,483,647),所以如果结果恰好是 -2,147,483,648,abs() 函数会返回溢出错误!机会显然非常低,大约 40 亿分之一,但是我们每天在 ~1.8b 行表上运行它,所以它大约每周发生一次!修复是在 abs 之前将校验和转换为 bigint。

于 2018-06-28T19:38:57.770 回答