python - 从 df 中按类别抽取随机子样本

Question

我有一个这样的数据框

names = ["Patient 1", "Patient 2", "Patient 3", "Patient 4", "Patient 5", "Patient 6", "Patient 7"]
categories = ["Internal medicine, Gastroenterology", "Internal medicine, General Med, Endocrinology", "Pediatrics, Medical genetics, Laboratory medicine", "Internal medicine", "Endocrinology", "Pediatrics", "General Med, Laboratory medicine"]

zippedList =  list(zip(names, categories))
df = pd.DataFrame(zippedList, columns=['names', 'categories'])

产生：

print(df)
names                                         categories
0  Patient 1                Internal medicine, Gastroenterology
1  Patient 2      Internal medicine, General Med, Endocrinology
2  Patient 3  Pediatrics, Medical genetics, Laboratory medicine
3  Patient 4                                  Internal medicine
4  Patient 5                                      Endocrinology
5  Patient 6                                         Pediatrics
6  Patient 7                   General Med, Laboratory medicine

（真正的数据框有 >1000 行）

并计算类别产量：

print(df['categories'].str.split(", ").explode().value_counts())

Internal medicine      3
General Med            2
Endocrinology          2
Laboratory medicine    2
Pediatrics             2
Gastroenterology       1
Medical genetics       1

我想绘制一个随机的n行子样本，以便按比例表示每个中间类别。例如，13 个类别中的 3 个（~23%）是“内科”。因此，约 23% 的子样本应具有此类别。如果每个患者有 1 个类别，这不会太难，但不幸的是他们可以有多个（例如，患者 3 甚至有 3 个类别）。我怎样才能做到这一点？

score 0 · Accepted Answer

您的患者有许多类别这一事实不会影响二次抽样过程。当您从 nrows 中取出 n 行（即 len(df) ）时，子采样将保持类别权重，+/- 一个类别在您的随机子集中更具代表性的概率 - 随着 n 变高，它收敛到 0 -

通常，

n = 2000
df2 = df.sample(n).copy(deep = True)
print(df2['categories'].str.split(", ").explode().value_counts())

应该按照你想要的方式工作。

我还读到您有大约 1000 个类别。不要忘记在使用前对它们进行预处理，因为其中一些可能会在您的子采样拟合后消失。

python - 从 df 中按类别抽取随机子样本

1 回答 1

Related

Reference