归一化特征向量以用于线性内核 SVM 的正确方法是什么?
查看 LIBSVM,看起来它只是通过将每个功能重新缩放到一个标准的上限/下限来完成的。但是,PyML 似乎没有提供一种以这种方式扩展数据的方法。取而代之的是,有一些选项可以按向量的长度对向量进行归一化,按平均值移动每个特征值,同时按标准差重新缩放,等等。
我正在处理大多数特征都是二进制的情况,除了少数是数字的。
I am not an expert in this, but I believe centering and scaling each feature vector by subtracting its mean and dividing thereafter by the standard deviation is a typical way to normalize feature vectors for use with SVMs. In R, this can be done with the scale function.
Another way is to transform each feature vector to the [0,1] range:
(x - min(x)) / (max(x) - min(x))
Maybe some features could benefit from a log-transformation if the distribution is very scewed, but this would change the shape of the distribution as well and not only "move" it.
I am not sure what you gain in an SVM-setting by normalizing the vectors by their L1 or L2 norm like PyML does with its normalize method. I guess binary features (0 or 1) don't need to be normalized.