-1

I have a dataset with lots of features (mostly categorical features(Yes/No)) and lots of missing values.

One of the techniques for dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features. That is basically we can generate a large set of very shallow trees, with each tree being trained on a small fraction of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain.

I am also using an imputer to fill the missing values.

My doubt is what should be the order to the above two. Which of the above two (dimensionality reduction and imputation) to do first and why?

4

1 回答 1

-1

从数学的角度来看,您应该始终避免数据插补(从某种意义上说-仅在必要时才使用它)。换句话说——如果你有一种可以处理缺失值的方法——使用它(如果你没有——你就剩下数据插补了)。

数据估算几乎总是有很大的偏差,它已经被展示了很多次,我相信我什至读过大约 20 年的论文。一般来说 - 为了进行统计上合理的数据插补,您需要拟合一个非常好的生成模型。只需估算“最常见”、平均值等,就可以对与朴素贝叶斯相似强度的数据进行假设。

于 2016-06-01T22:26:28.297 回答