请参阅@agartland 答案-您可以在 sklearn.metrics.pairwise.pairwise_distances 中指定或n_jobs
在sklearn.cluster中查找带有n_jobs
参数的聚类算法。例如。sklearn.cluster.KMeans
.
不过,如果您喜欢冒险,您可以实现自己的计算。例如,如果您需要一维距离矩阵,scipy.cluster.hierarchy.linkage
您可以使用:
#!/usr/bin/env python3
from multiprocessing import Pool
import numpy as np
from time import time as ts
data = np.zeros((100,10)) # YOUR data: np.array[n_samples x m_features]
n_processes = 4 # YOUR number of processors
def metric(a, b): # YOUR dist function
return np.sum(np.abs(a-b))
n = data.shape[0]
k_max = n * (n - 1) // 2 # maximum elements in 1D dist array
k_step = n ** 2 // 500 # ~500 bulks
dist = np.zeros(k_max) # resulting 1D dist array
def proc(start):
dist = []
k1 = start
k2 = min(start + k_step, k_max)
for k in range(k1, k2):
# get (i, j) for 2D distance matrix knowing (k) for 1D distance matrix
i = int(n - 2 - int(np.sqrt(-8 * k + 4 * n * (n - 1) - 7) / 2.0 - 0.5))
j = int(k + i + 1 - n * (n - 1) / 2 + (n - i) * ((n - i) - 1) / 2)
# store distance
a = data[i, :]
b = data[j, :]
d = metric(a, b)
dist.append(d)
return k1, k2, dist
ts_start = ts()
with Pool(n_processes) as pool:
for k1, k2, res in pool.imap_unordered(proc, range(0, k_max, k_step)):
dist[k1:k2] = res
print("{:.0f} minutes, {:,}..{:,} out of {:,}".format(
(ts() - ts_start)/60, k1, k2, k_max))
print("Elapsed %.0f minutes" % ((ts() - ts_start) / 60))
print("Saving...")
np.savez("dist.npz", dist=dist)
print("DONE")
如您所知,scipy.cluster.hierarchy.linkage
实现不是并行的,其复杂性至少为 O(N*N)。我不确定是否scipy
有此功能的并行实现。