2

我有一个分类和连续特征的数据集,其中很多都缺少元素。我想知道我是否可以使用相应的 imputer 来填写连续数据和分类数据。

如果不能完成,最好的方法是什么?最好将数据分成连续特征和离散特征,然后使用例如 IterativeImputer 用于第一组,KNN 用于第二组,然后合并它们?

任何帮助,将不胜感激。

数据包含 65 个特征:

x_train

        age         sex painloc painexer relrest    cp   trestbps      htn     chol      smoke      ...     om1     om2 rcaprox rcadist     lvx1    lvx2    lvx3    lvx4    lvf     cathef
288     -1.109572   1.0     0.0     0.0     0.0     1.0     -0.655059   0.0     0.818661    NaN     ...     NaN     NaN     NaN     NaN     1.0     1.0     1.0     1.0     2.0     0.568676
283     -0.180525   1.0     1.0     0.0     0.0     2.0     1.447445    0.0     -0.040919   NaN     ...     NaN     NaN     NaN     NaN     1.0     1.0     1.0     1.0     1.0     NaN
230     -0.077297   1.0     1.0     1.0     0.0     3.0     0.659006    1.0     2.872604    NaN     ...     2.0     NaN     2.0     NaN     1.0     1.0     1.0     1.0     1.0     NaN
380     -0.799890   0.0     1.0     1.0     1.0     4.0     -0.129433   0.0     0.339106    NaN     ...     NaN     NaN     NaN     NaN     1.0     1.0     1.0     1.0     1.0     NaN
147     0.129157    1.0     1.0     1.0     1.0     4.0     NaN     0.0     0.031467    0.0     ...     1.0     1.0     1.0     1.0     1.0     1.0     1.0     1.0     1.0     -0.822164
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
847     -0.180525   0.0     NaN     NaN     NaN     3.0     0.185942    1.0     -0.040919   NaN     ...     1.0     NaN     1.0     1.0     1.0     1.0     1.0     1.0     1.0     NaN
301     -0.283752   1.0     1.0     1.0     1.0     4.0     -0.129433   0.0     -0.194738   NaN     ...     NaN     NaN     NaN     NaN     1.0     1.0     1.0     1.0     1.0     NaN
693     0.645295    1.0     NaN     NaN     NaN     4.0     -0.392246   1.0     0.520070    NaN     ...     1.0     NaN     2.0     1.0     1.0     1.0     1.0     1.0     1.0     NaN
115     1.058204    1.0     1.0     1.0     1.0     4.0     NaN     0.0     0.954384    0.0     ...     1.0     1.0     2.0     1.0     1.0     1.0     1.0     1.0     1.0     -0.811925
155     1.574341    1.0     1.0     1.0     1.0     4.0     NaN     1.0     NaN     0.0     ...     1.0     1.0     1.0     1.0     1.0     1.0     1.0     1.0     1.0     NaN

我已经标准化了连续变量。有许多分类特征,如 'painloc' 和 'painexer' 有缺失值,还有一些连续特征,如 'age' (我决定将其视为连续的)和 'chol' 也有缺失元素。

我尝试使用 IterativeImputer:

x_mice=x_train
mice_impute = IterativeImputer(sample_posterior=True)
x_mice=pd.DataFrame(mice_impute.fit_transform(x_mice))
x_mice.columns=labels
x_mice

     age    sex     painloc     painexer    relrest     cp  trestbps    htn     chol    smoke   ...     om1     om2     rcaprox     rcadist     lvx1    lvx2    lvx3    lvx4    lvf     cathef
0   1.049449    1.0     1.000000    1.000000    1.000000    4.0     0.444874    0.000000    0.540723    0.000000    ...     1.000000    1.000000    2.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    -0.891887
1   0.505617    1.0     1.000000    1.000000    0.000000    2.0     -0.266785   0.000000    -1.752150   0.000000    ...     1.000000    1.000000    2.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    -0.888760
2   0.831916    1.0     1.000000    0.000000    0.000000    4.0     -1.080109   0.764037    -1.752150   1.450166    ...     1.000000    1.000000    1.000000    1.000000    1.025761    0.879404    -0.400332   3.193691    3.267492    1.118696
3   -0.582047   1.0     1.000000    0.000000    0.000000    2.0     -1.588436   0.000000    -0.249794   0.000000    ...     1.383778    1.048614    -0.147575   1.942328    1.000000    1.000000    1.000000    1.000000    1.000000    0.802084
4   -1.452178   1.0     1.000000    0.000000    0.000000    3.0     0.444874    1.000000    5.232542    1.000000    ...     1.235595    1.249215    2.269437    1.155985    1.000000    1.000000    1.000000    1.000000    1.000000    -1.935223
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
624     0.179318    1.0     1.000000    1.000000    1.000000    4.0     -0.571781   0.000000    0.628910    -0.060307   ...     0.928614    0.830982    1.080936    1.185430    1.000000    1.000000    1.000000    1.000000    1.000000    -1.032691
625     1.702047    1.0     1.000000    0.000000    1.000000    3.0     0.444874    0.000000    -1.752150   0.000000    ...     1.000000    1.000000    2.000000    1.000000    1.000000    1.000000    1.000000    1.000000    2.000000    -0.895014
626     -0.364514   1.0     0.694690    1.738101    0.396025    4.0     0.953201    1.000000    0.390804    1.287500    ...     1.000000    0.739708    2.000000    1.000000    1.000000    1.000000    1.000000    1.000000    2.000000    -0.523902
627     0.723149    1.0     0.762459    0.038032    0.315826    4.0     0.444874    1.000000    0.831741    0.750375    ...     1.000000    0.912221    2.000000    1.000000    1.000000    1.000000    1.000000    1.000000    2.000000    0.730936
628     0.940682    1.0     1.000000    1.000000    1.000000    4.0     -0.000217   0.000000    -0.252964   0.000000    ...     1.000000    1.000000    2.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    -0.888134

它适用于连续特征,但不适用于分类,因为它可以填写十进制数字,这显然是不对的。

4

0 回答 0