machine-learning - Is it right to normalize data and/or weight vectors in a SOM?

Question

So I am being stumped by something that (should) be simple:

I have written a SOM for a simple 'play' two-dimensional data set. Here is the data:

enter image description here

You can make out 3 clusters by yourself.

Now, there are two things that confuse me. The first is that the tutorial that I have, normalizes the data before the SOM gets to work on it. This means, it normalizes each data vector to have length 1. (Euclidean norm). If I do that, then the data looks like this:

enter image description here

(This is because all the data has been projected onto the unit circle).

So, my question(s) are as follows:

1) Is this correct? Projecting the data down onto the unit circle seems to be bad, because you can no longer make out 3 clusters... Is this a fact of life for SOMs? (ie, that they only work on the unit circle).

2) The second related question is that not only are the data normalized to have length 1, but so are the weight vectors of each output unit after every iteration. I understand that they do this so that the weight vectors dont 'blow up', but it seems wrong to me, since the whole point of the weight vectors is to retain distance information. If you normalize them, you lose the ability to 'cluster' properly. For example, how can the SOM possibly distinguish between the cluster on the lower left, from the cluster on the upper right, since they project down to the unit circle the same way?

I am very confused by this. Should data be normalized to unit length in SOMs? Should the weight vectors be normalized as well?

Thanks!

EDIT

Here is the data, saved as a .mat file for MATLAB. It is a simple 2 dimensional data set.

score 10 · Accepted Answer

要决定是否要对输入数据进行规范化，这取决于这些数据代表什么。假设您对二维（或三维）输入数据进行聚类，其中每个数据向量代表一个空间点。第一个维度是 x 坐标，第二个维度是 y 坐标。在这种情况下，您不会对输入数据进行规范化，因为输入特征（每个维度）彼此之间是可比较的。

如果您在二维空间上再次进行聚类，但每个输入向量代表一个人的年龄和年收入，第一个特征（维度）是年龄，第二个是年收入，那么您必须对输入特征进行归一化，因为它们代表不同的东西（不同的测量单位）并且以完全不同的比例。让我们检查这些输入向量：D1(25, 30000)、D2(50, 30000) 和 D3(25, 60000)。与 D1 相比，D2 和 D3 都将其中一项功能加倍。请记住，SOM 使用欧几里得距离度量。距离（D1，D2）= 25 和距离（D1，D3）= 30000。对于第一个输入特征（年龄）来说，这有点“不公平”，因为尽管将它加倍，但与第二个示例（ D1，D3)。

检查这个，它也有一个类似的例子

如果要对输入数据进行规范化，请对每个特征/维度（输入数据表上的每一列）进行规范化。引用som_normalize 手册：

“规范化始终是单变量操作”

还要检查这个以获得关于规范化的简要说明，如果你想阅读更多，试试这个（第 7 章是你想要的）

编辑：

最常见的归一化方法是将每个维度数据缩放到 [0,1] 或将它们转换为具有零均值和标准偏差 1。第一种方法是通过从每个输入中减去其维度（列）的最小值和除法来完成最大值 minun 最小值（其维度）。

Xi,norm = (Xi - Xmin)/(Xmax-Xmin)

Yi,norm = (Yi - Ymin)/(Ymax-Ymin)

在第二种方法中，您减去每个维度的平均值，然后除以标准差。

Xi,norm = (Xi - Xmean)/(Xsd)

每种方法都有优点/缺点。例如，第一种方法对数据中的异常值非常敏感。您应该在检查数据集的统计特征后进行选择。

在单位圆中投影实际上不是一种归一化方法，而更像是一种降维方法，因为在投影之后，您可以用单个数字（例如它的角度）替换每个数据点。你不必这样做。

score 3 · Accepted Answer

在 SOM 训练算法中，使用一堆不同的度量来计算向量之间的距离（模式和权重）。举几个例子（也许是最广泛使用的）：欧几里得距离和点积。如果将向量和权重归一化，它们是等价的，并允许网络以最有效的方式学习。例如，如果您不规范化当前数据，网络将处理来自输入空间不同部分的具有不同偏差的点（较大的值将产生较大的影响）。这就是为什么统一标准化很重要，并且在大多数情况下被认为是适当的步骤（特别是，如果使用点积作为衡量标准）。

在将其标准化为单位圆之前，应准备好您的源数据。您应该将数据映射到两个轴上的 [-1, 1] 区域。为此存在几种算法，其中一种使用简单的公式：

mult_factor = 2 / (max - min);
offset_factor = 1 - 2 * max / (max - min),

其中min和max是数据集或域边界中的最小值和最大值，如果事先知道的话。每个维度都单独处理。对于您的情况，这将是 X 和 Y 坐标。

Xnew = Xold * Xmult_factor + Xoffset_factor, i = 1..N
Ynew = Yold * Ymult_factor + Yoffset_factor, i = 1..N

无论映射之前min和max之前的实际值是什么（在您的情况下可以是 [0,1] 或 [-3.6, 10]），在映射之后它们将落入范围 [-1, 1] . 实际上，上面的公式专门用于将数据转换为范围 [-1, 1]，因为它们只是从一个范围到另一个范围的一般转换过程的特例：

数据[i] = (data[i] - old_min) * (new_max - new_min) / (old_max - old_min) + new_min；

映射后，您可以对单位圆进行归一化处理，这样您最终将得到一个以 [0, 0] 为中心的圆。

您可以在此页面上找到更多信息。尽管该站点一般不是关于神经网络的，但这个特定页面提供了对 SOM 的很好的解释，包括关于数据规范化的描述性图表。

machine-learning - Is it right to normalize data and/or weight vectors in a SOM?

2 回答 2

Related

Reference