我有一组数据,它是 1000 个同源蛋白质序列的距离矩阵。
我已经设法为此计算了亲和力矩阵(简单计算:1 - 距离,在我的例子中)。
基本上,如果在 Excel 中查看数据,没有标题行,第一列是序列名称,然后接下来的 1000 列是距离值。
我已经修改了 sklearn 的 Affinity Propagation 站点上提供的代码。这就是它现在的样子:
print __doc__
import numpy as np
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
import csv
##############################################################################
f = open('ha-sequences-sample-distmat2.csv', 'rU')
csvreader = csv.reader(f)
sequence_names = []
distance_matrix = []
full_data = []
for row in csvreader:
# print row
sequence_names.append(row[0])
distance_matrix.append(row[1:])
full_data.append(row)
f.close()
distmat = np.array([row for row in distance_matrix]).astype(np.float)
# print distmat
affinity_matrix = np.array([1 - row for row in distmat]).astype(np.float)
full_matrix = zip(sequence_names, affinity_matrix)
# print affinity_matrix, sequence_names
##############################################################################
# Compute Affinity Propagation
af = AffinityPropagation(affinity='precomputed').fit(affinity_matrix)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices)
print 'Estimated number of clusters: %d' % n_clusters_
print "Homogeneity: %0.3f" % metrics.homogeneity_score(sequence_names, labels)
print "Completeness: %0.3f" % metrics.completeness_score(sequence_names, labels)
print "V-measure: %0.3f" % metrics.v_measure_score(sequence_names, labels)
print "Adjusted Rand Index: %0.3f" % \
metrics.adjusted_rand_score(sequence_names, labels)
print("Adjusted Mutual Information: %0.3f" %
metrics.adjusted_mutual_info_score(sequence_names, labels))
print("Silhouette Coefficient: %0.3f" %
metrics.silhouette_score(affinity_matrix, labels, metric='sqeuclidean'))
##############################################################################
# Plot result
import pylab as pl
from itertools import cycle
pl.close('all')
pl.figure(1)
pl.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = affinity_matrix[cluster_centers_indices[k]]
pl.plot(affinity_matrix[class_members, 0], affinity_matrix[class_members, 1], col + '.')
pl.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
for x in affinity_matrix[class_members]:
pl.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
pl.title('Estimated number of clusters: %d' % n_clusters_)
pl.show()
我遇到的问题是:我不知道如何输出与每个集群对应的序列名称。如果我可以将聚集在一起的序列输出到 shell 并在绘图上显示簇编号,那将是最好的,但即使我不在绘图上显示东西,那也很酷。
有人知道怎么做这个吗?