0

为了训练机器学习模型,我必须检索一个用户样本,其中包括平衡数量的当前用户和以前的用户。Tha 数据库由表 all_users 和 front_users 组成。

如果样本不平衡(100 条记录),以下查询将返回具有所需列的记录:

SELECT t1.user_property1, t2.user_property2, t3.valid_to FROM additional_info t1 LEFT JOIN all_users t2 ON t1.user_ID = t2.user_ID LEFT JOIN former_users t3 ON t1.user_ID = t3.user_ID ORDER BY random() LIMIT 100

为了获得平衡的样本,应该有一半的用户记录存储在表former_users中,一半来自表all_users,同时不在表former_users中(否则样本不会平衡)。

有谁知道,从表all_users和former_users以及表additional_info的附加属性中检索平衡随机样本的最方便方法是什么?

谢谢!

4

2 回答 2

1

您可能会考虑做的一件事是:

Query 1 - SELECTS random non-former users joined to additional_info with a LIMIT of 50
Query 2 - SELECTS random former users joined to additional_info with a LIMIT of 50

然后将结果与 UNION 结合起来

(Query 1) UNION (Query 2)

这将为您提供两个标准的随机结果,总共有 100 个用户。

于 2012-10-02T16:52:24.060 回答
1

做了以下事情:

(SELECT t1.user_property1, t2.user_property2, t3.valid_to FROM additional_info t1 LEFT JOIN all_users t2 ON t1.user_ID = t2.user_ID INNER JOIN former_users t3 ON t1.user_ID = t3.user_ID ORDER BY random() LIMIT 50)
UNION
(SELECT t1.user_property1, t2.user_property2, NULL FROM additional_info t1 LEFT JOIN all_users t2 ON t1.user_ID = t2.user_ID WHERE t1.user NOT IN (SELECT user_ID FROM former_users) ORDER BY random() LIMIT 50)

但正在寻找更好的解决方案。

于 2012-10-02T19:52:29.857 回答