我有一个分类和连续特征的数据集,其中很多都缺少元素。我想知道我是否可以使用相应的 imputer 来填写连续数据和分类数据。
如果不能完成,最好的方法是什么?最好将数据分成连续特征和离散特征,然后使用例如 IterativeImputer 用于第一组,KNN 用于第二组,然后合并它们?
任何帮助,将不胜感激。
数据包含 65 个特征:
x_train
age sex painloc painexer relrest cp trestbps htn chol smoke ... om1 om2 rcaprox rcadist lvx1 lvx2 lvx3 lvx4 lvf cathef
288 -1.109572 1.0 0.0 0.0 0.0 1.0 -0.655059 0.0 0.818661 NaN ... NaN NaN NaN NaN 1.0 1.0 1.0 1.0 2.0 0.568676
283 -0.180525 1.0 1.0 0.0 0.0 2.0 1.447445 0.0 -0.040919 NaN ... NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 NaN
230 -0.077297 1.0 1.0 1.0 0.0 3.0 0.659006 1.0 2.872604 NaN ... 2.0 NaN 2.0 NaN 1.0 1.0 1.0 1.0 1.0 NaN
380 -0.799890 0.0 1.0 1.0 1.0 4.0 -0.129433 0.0 0.339106 NaN ... NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 NaN
147 0.129157 1.0 1.0 1.0 1.0 4.0 NaN 0.0 0.031467 0.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 -0.822164
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
847 -0.180525 0.0 NaN NaN NaN 3.0 0.185942 1.0 -0.040919 NaN ... 1.0 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN
301 -0.283752 1.0 1.0 1.0 1.0 4.0 -0.129433 0.0 -0.194738 NaN ... NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 NaN
693 0.645295 1.0 NaN NaN NaN 4.0 -0.392246 1.0 0.520070 NaN ... 1.0 NaN 2.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN
115 1.058204 1.0 1.0 1.0 1.0 4.0 NaN 0.0 0.954384 0.0 ... 1.0 1.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 -0.811925
155 1.574341 1.0 1.0 1.0 1.0 4.0 NaN 1.0 NaN 0.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN
我已经标准化了连续变量。有许多分类特征,如 'painloc' 和 'painexer' 有缺失值,还有一些连续特征,如 'age' (我决定将其视为连续的)和 'chol' 也有缺失元素。
我尝试使用 IterativeImputer:
x_mice=x_train
mice_impute = IterativeImputer(sample_posterior=True)
x_mice=pd.DataFrame(mice_impute.fit_transform(x_mice))
x_mice.columns=labels
x_mice
age sex painloc painexer relrest cp trestbps htn chol smoke ... om1 om2 rcaprox rcadist lvx1 lvx2 lvx3 lvx4 lvf cathef
0 1.049449 1.0 1.000000 1.000000 1.000000 4.0 0.444874 0.000000 0.540723 0.000000 ... 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 -0.891887
1 0.505617 1.0 1.000000 1.000000 0.000000 2.0 -0.266785 0.000000 -1.752150 0.000000 ... 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 -0.888760
2 0.831916 1.0 1.000000 0.000000 0.000000 4.0 -1.080109 0.764037 -1.752150 1.450166 ... 1.000000 1.000000 1.000000 1.000000 1.025761 0.879404 -0.400332 3.193691 3.267492 1.118696
3 -0.582047 1.0 1.000000 0.000000 0.000000 2.0 -1.588436 0.000000 -0.249794 0.000000 ... 1.383778 1.048614 -0.147575 1.942328 1.000000 1.000000 1.000000 1.000000 1.000000 0.802084
4 -1.452178 1.0 1.000000 0.000000 0.000000 3.0 0.444874 1.000000 5.232542 1.000000 ... 1.235595 1.249215 2.269437 1.155985 1.000000 1.000000 1.000000 1.000000 1.000000 -1.935223
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
624 0.179318 1.0 1.000000 1.000000 1.000000 4.0 -0.571781 0.000000 0.628910 -0.060307 ... 0.928614 0.830982 1.080936 1.185430 1.000000 1.000000 1.000000 1.000000 1.000000 -1.032691
625 1.702047 1.0 1.000000 0.000000 1.000000 3.0 0.444874 0.000000 -1.752150 0.000000 ... 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 -0.895014
626 -0.364514 1.0 0.694690 1.738101 0.396025 4.0 0.953201 1.000000 0.390804 1.287500 ... 1.000000 0.739708 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 -0.523902
627 0.723149 1.0 0.762459 0.038032 0.315826 4.0 0.444874 1.000000 0.831741 0.750375 ... 1.000000 0.912221 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 0.730936
628 0.940682 1.0 1.000000 1.000000 1.000000 4.0 -0.000217 0.000000 -0.252964 0.000000 ... 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 -0.888134
它适用于连续特征,但不适用于分类,因为它可以填写十进制数字,这显然是不对的。