1

I am new to machine learning and I am working on a classification problem with Categorical (nominal) data. I have tried applying BayesNet and a couple of Trees and Rules classification algorithms to the raw data. I am able to achieve an AUC of 0.85.

I further want to improve the AUC by pre-processing or transforming the data. However since the data is categorical I don't think that log transform, addition, multiplication etc. of different columns will work here.

Can somebody list down what are most common transformations applied on categorical data-sets? ( I tried one-hot encoding but it takes a lot of memory!!)

4

1 回答 1

2

正如您所提到的,根据我的经验,分类是最好的处理单热编码(例如转换为二进制向量)。如果内存是一个问题,那么使用在线分类算法并动态生成修改后的向量可能是值得的。

除此之外,如果类别代表一个范围(例如,如果类别代表一系列值,如年龄、身高或收入),则可以处理中心(或一些适当的平均值,如果有标签内分布)的类别范围作为一个实数。

如果您正在应用聚类,您还可以将分类标签视为轴上的点(1、2、3、4、5 等),并适当地缩放到其他特征。

于 2013-07-30T14:20:56.170 回答