2

我正在用 Numpy 在 Python 中编写 K-Means 算法。到所有质心的距离部分非常优化(使用质心矩阵而不是单独计算每个质心),但我正在努力计算新质心部分。我正在从数据集中复制每个质心的数据来计算平均值。

我认为不复制会更快。我如何在 Python/Numpy 中做到这一点?

代码片段:

    for c_i in range(k):
        sub_data = np.zeros([n_per_c[c_i],data_width])

        sub_data_i = 0
        for data_i in range(data_length):
            if label[data_i] == c_i:                    
                sub_data[sub_data_i,:] = data[data_i,:]
                sub_data_i += 1

        c[c_i] = np.mean(sub_data, axis=0)

c 是我拥有的质心列表,data 是整个数据集,label 是带有类标签的列表。

4

1 回答 1

1

I think the following does the same as your code, without any explicit intermediate array:

for c_i in range(k):
    c[c_i] = np.mean(data[label == c_i, :], axis=0)

Getting rid of that last loop is tougher, but this should work:

label_counts = np.bincount(label)
label_sums = np.histogram2d(np.repeat(label, data_length),
                            np.tile(np.arange(data_length), k),
                            bins=(k, data_length),
                            weights=data.ravel())[0]
c = label_sums / label_count[:, None]
于 2013-07-23T12:37:05.150 回答