我有几乎以“疾病”和“无疾病”为目标的100000
数据点。15
但是我的数据不平衡。97%
我的数据不是疾病,3%
是疾病。7
为了克服这个问题,我通过从实际数据创建副本并将其与原始数据合并来手动创建疾病数据。使用此代码。
#selecting data with disease is 1
# Even created unique 'patient ID' by adding a dummy letter as a suffix to the #original ID.
ia = df[df['disease']==1]
dup = pd.DataFrame()
for i,j in zip(['a','b','c','d','e','f'],['B','C','E','F','G','H']):
i = ia.copy()
i['dum'] = j
i["patient ID"] = i["Employee Code"]+ i['dum']
dup= pd.concat([dup,i])
# adding the copies to the original data
df = pd.concat([dup,df])
请让我知道这是否是过采样的正确方法。