python - 为化学指纹选择 n 个簇

Question

您好，我正在尝试对化学指纹进行聚类

我正在使用为集群提供分层方法的 rdkit，问题是我知道我想要拥有 13 个集群的集群数量，所以我使用基于 scikit 的 tanimoto 相似度得分的 kmean 方法

这是我的代码：

smiles = []
molFin = []
fps = []
np_fps = []

#mol["idx"] contain the name of the molecules
for x in mol["idx"]:
    res = cs.search(x)
    #get the smiles code of a molecule
    smi = res[0].smiles

    #get the fingerprint of the molecule
    fp = Chem.MolFromSmiles(str(smi))
    fp = FingerprintMols.FingerprintMol(fp)
    fps.append(fp)


#compute the similarity score (end up with a cross molecule matrix where each occurence correspond to the taminoto score)

dists = []
nfps = len(fps)
for i in range(0,nfps):
    sims = DataStructs.BulkTanimotoSimilarity(fps[i],fps)
    dists.append(sims)

#store the value on a data frame and apply kmean
mol_dist = pd.DataFrame(dists)

k_means = cluster.KMeans(n_clusters=13)
k1 = k_means.fit_predict(mol_dist) 
mol["cluster"]  = k1

#get the result
final = mol[["idx","cluster"]]

聚类似乎以某种方式起作用，但我不知道我们如何对化学指纹进行聚类，我们是否应该将聚类算法直接应用于指纹本身？

score 0 · Accepted Answer

我认为聚类中的问题是如何选择合适的k。您的问题可能会通过以下方式解决：

确定适当的 k 簇数。您可以使用一些方法，例如 Elbow，...请参阅下面的链接 - https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering
在获得 k 数之后，您可以选择适当的特征以及获得的 k 聚类，然后对您的数据集进行聚类和评估。

最良好的问候！

python - 为化学指纹选择 n 个簇

1 回答 1

Related

Reference