postgresql - 加权随机选择

Question

请。我有两张最常见的名字和姓氏的表。每个表基本上有两个字段：

表

CREATE TABLE "common_first_name" (
    "first_name" text PRIMARY KEY, --The text representing the name
    "ratio" numeric NOT NULL, -- the % of how many times it occurs compared to the other names.     
    "inserted_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL,
    "updated_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL
);

CREATE TABLE "common_last_name" (
    "last_name" text PRIMARY KEY, --The text representing the name
    "ratio" numeric NOT NULL, -- the % of how many times it occurs compared to the other names.     
    "inserted_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL,
    "updated_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL
);

PS：TOP 1 的名字出现的几率只有 ~ 1.8%。这些表每个有 1000 行。

函数（伪，未就绪）

CREATE OR REPLACE FUNCTION create_sample_data(p_number_of_records INT)
    RETURNS VOID
    AS $$
DECLARE
    SUM_OF_WEIGHTS CONSTANT INT := 100;
BEGIN

    FOR i IN 1..coalesce(p_number_of_records, 0) LOOP
      --Get the random first and last name but taking in consideration their probability (RATIO)round(random()*SUM_OF_WEIGHTS); 
      --create_person (random_first_name || ' ' || random_last_name);
    END LOOP;
END
$$
LANGUAGE plpgsql VOLATILE;

PS：每个名称（每个表）的所有比率总和为 100%。

我想运行一个函数 N 次并获取一个名字和一个姓氏来创建示例数据......两个表每个都有 1000 行。

样本量可以是从 1000 个全名到 1000000 个名字的任何地方，所以如果有一种“快速”的方法来执行这个随机加权函数，那就更好了。

关于如何在 PL/PGSQL 中执行此操作的任何建议？

我在 SUPABASE.IO 上使用 PG 13.3。

谢谢

score 2 · Accepted Answer

鉴于输入数据集较小，在纯 SQL 中执行此操作很简单。使用 CTE 为每个 common_FOO_name 表中的每一行构建下限和上限列，然后使用 generate_series() 生成随机数集。将所有内容连接在一起，并使用边界之间的随机值作为 WHERE 子句。

with first_names_weighted as (
  select first_name,
    sum(ratio) over (order by first_name) - ratio as lower_bound,
    sum(ratio) over (order by first_name) as upper_bound
  from common_first_name
),
last_names_weighted as (
  select last_name,
    sum(ratio) over (order by last_name) - ratio as lower_bound,
    sum(ratio) over (order by last_name) as upper_bound
  from common_last_name
),
randoms as (
  select random() * (select sum(ratio) from common_first_name) as f_random,
         random() * (select sum(ratio) from common_last_name) as l_random
  from generate_series(1, 32)
)
select r, first_name, last_name
from randoms r
cross join first_names_weighted f
cross join last_names_weighted l
where f.lower_bound <= r.f_random and r.f_random <= f.upper_bound
  and l.lower_bound <= r.l_random and r.l_random <= l.upper_bound;

更改传递给的值generate_series()以控制要生成的名称数量。如果它是一个函数很重要，您可以使用 LANGAUGE SQL 函数定义来参数化该数字：

https://www.db-fiddle.com/f/mmGQRhCP2W1yfhZTm1yXu5/3

postgresql - 加权随机选择

1 回答 1

Related

Reference