python - 使用 k-Means 聚类算法预测值

Question

我在搞机器学习，我用 Python 编写了一个 K 均值算法实现。它采用二维数据并将它们组织成集群。每个数据点还具有一个 0 或 1 的类值。

该算法让我感到困惑的是，我如何使用它来预测另一组没有 0 或 1 而是未知的二维数据的一些值。对于每个集群，我应该将其中的点平均为 0 还是 1，如果未知点最接近该集群，那么该未知点取平均值？还是有更聪明的方法？

干杯!

score 16 · Accepted Answer

要将新数据点分配给由 k-means 创建的一组集群中的一个，您只需找到最接近该点的质心。

换句话说，您用于将原始数据集中的每个点迭代分配到 k 个集群之一的相同步骤。此处唯一的区别是您用于此计算的质心是最终集合——即最后一次迭代时的质心值。

这是python中的一个实现（w/NumPy）：

>>> import numpy as NP
>>> # just made up values--based on your spec (2D data + 2 clusters)
>>> centroids
      array([[54, 85],
             [99, 78]])

>>> # randomly generate a new data point within the problem domain:
>>> new_data = NP.array([67, 78])

>>> # to assign a new data point to a cluster ID,
>>> # find its closest centroid:
>>> diff = centroids - new_data[0,:]  # NumPy broadcasting
>>> diff
      array([[-13,   7],
             [ 32,   0]])

>>> dist = NP.sqrt(NP.sum(diff**2, axis=-1))  # Euclidean distance
>>> dist
      array([ 14.76,  32.  ])

>>> closest_centroid = centroids[NP.argmin(dist),]
>>> closest_centroid
       array([54, 85])

score 2 · Accepted Answer

我知道我可能会迟到，但这是我对您的问题的一般解决方案：

def predict(data, centroids):
    centroids, data = np.array(centroids), np.array(data)
    distances = []
    for unit in data:
        for center in centroids:
            distances.append(np.sum((unit - center) ** 2))                
    distances = np.reshape(distances, data.shape)
    closest_centroid = [np.argmin(dist) for dist in distances]
    print(closest_centroid)

score 1 · Accepted Answer

如果您正在考虑根据最近集群内的平均值分配一个值，那么您正在谈论某种形式的“软解码器”，它不仅估计坐标的正确值，而且估计您对估计的置信度。另一种选择是“硬解码器”，其中只有 0 和 1 的值是合法的（出现在训练数据集中），新坐标将获得最近集群内的值的中值。我的猜测是，您应该始终只为每个坐标分配一个已知有效的类值（0 或 1），并且平均类值不是一种有效的方法。

score 0 · Accepted Answer

这就是我将标签分配给更接近现有质心的方式。实现在线/增量集群，为现有集群创建新分配，但保持质心固定也很有用。小心，因为在（假设）5-10% 的新点之后，您可能需要重新计算质心坐标。

def Labs( dataset,centroids ):    
a = []
for i in range(len(dataset)):
    d = []
    for j in range(n):        
        dist = np.linalg.norm(dataset[(i),:]-centroids[(j),:])
        d.append(dist)
    assignment = np.argmin(d)
    a.append(assignment)
return pd.DataFrame(np.array(a) + 1,columns =['Lab'])

我希望它有帮助

python - 使用 k-Means 聚类算法预测值

4 回答 4

Related

Reference