您好,我正在尝试对化学指纹进行聚类
我正在使用为集群提供分层方法的 rdkit,问题是我知道我想要拥有 13 个集群的集群数量,所以我使用基于 scikit 的 tanimoto 相似度得分的 kmean 方法
这是我的代码:
smiles = []
molFin = []
fps = []
np_fps = []
#mol["idx"] contain the name of the molecules
for x in mol["idx"]:
res = cs.search(x)
#get the smiles code of a molecule
smi = res[0].smiles
#get the fingerprint of the molecule
fp = Chem.MolFromSmiles(str(smi))
fp = FingerprintMols.FingerprintMol(fp)
fps.append(fp)
#compute the similarity score (end up with a cross molecule matrix where each occurence correspond to the taminoto score)
dists = []
nfps = len(fps)
for i in range(0,nfps):
sims = DataStructs.BulkTanimotoSimilarity(fps[i],fps)
dists.append(sims)
#store the value on a data frame and apply kmean
mol_dist = pd.DataFrame(dists)
k_means = cluster.KMeans(n_clusters=13)
k1 = k_means.fit_predict(mol_dist)
mol["cluster"] = k1
#get the result
final = mol[["idx","cluster"]]
聚类似乎以某种方式起作用,但我不知道我们如何对化学指纹进行聚类,我们是否应该将聚类算法直接应用于指纹本身?