0

嘿嘿,

我有一个来自不同群组的数据集,我想使用sklearn 函数 Spectral Biclustering对它们进行双聚类。正如您在上面的链接中看到的那样,这种方法使用一种归一化来计算 SVD。

是否有必要在双聚类之前对数据进行归一化,例如使用StandardScaling(零均值和标准为一)?因为上面的函数还是使用了一种归一化。 这是否足够或者我必须在之前对它们进行归一化,例如当数据来自不同的分布时?

无论有没有标准,我都会得到不同的结果,如果有必要,我在原始论文中找不到信息。

您可以找到我的数据集的代码和示例。这是真实数据,所以我不知道真相。我最后计算了共识分数以比较两个双聚类。不幸的是,集群并不相同。

我也用人工数据进行了尝试(参见最后一个链接的示例),这里的结果是相同的,但与真实数据不同。

那么我怎么知道哪种方法是正确的呢?

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.cluster.bicluster import SpectralBiclustering
from sklearn.metrics import consensus_score
from sklearn.preprocessing import StandardScaler

n_clusters = (4, 4)

data_org = pd.read_csv('raw_data_biclustering.csv', sep=',', index_col=0) 


# scale data & transform to dataframe
data_scaled = StandardScaler().fit_transform(data_org)
data_scaled = pd.DataFrame(data_scaled, columns=data_org.columns, index=data_org.index)


# plot original clusters
plt.imshow(data_scaled, aspect='auto', vmin=-3, vmax=5)
plt.title("Original dataset")
plt.show()


data_type = ['none_scaled', 'scaled']
data_all = [data_org, data_scaled]

models_all = []

for name, data in zip(data_type,data_all):

    # spectral biclustering on the shuffled dataset
    model = SpectralBiclustering(n_clusters=n_clusters, method='bistochastic'
                                         , svd_method='randomized', n_jobs=-1
                                         , random_state=0
                                         )
    model.fit(data)


    newOrder_row = [list(r) for r in zip(model.row_labels_, data.index)]
    newOrder_row.sort(key=lambda k: (k[0], k[1]), reverse=False)
    order_row = [i[1] for i in newOrder_row]

    newOrder_col = [list(c) for c in zip(model.column_labels_, [int(x) for x in data.keys()])]
    newOrder_col.sort(key=lambda k: (k[0], k[1]), reverse=False)
    order_col = [i[1] for i in newOrder_col]

    # reorder the data matrix
    X_plot = data_scaled.copy()
    X_plot = X_plot.reindex(order_row) # rows
    X_plot = X_plot[[str(x) for x in order_col]] # columns

    # use clustermap without clustering
    cm=sns.clustermap(X_plot, method=None, metric=None, cmap='viridis'
                  ,row_cluster=False, row_colors=None
                  , col_cluster=False, col_colors=None
                  , yticklabels=1, xticklabels=1
                  , standard_scale=None, z_score=None, robust=False
                  , vmin=-3, vmax=5
                  ) 

    ax = cm.ax_heatmap

    # set labelsize smaller
    cm_ax = plt.gcf().axes[-2]
    cm_ax.tick_params(labelsize=5.5)


    # plot lines for the different clusters
    hor_lines = [sum(item) for item in model.biclusters_[0]]
    hor_lines = list(np.cumsum(hor_lines[::n_clusters[1]]))

    ver_lines = [sum(item) for item in model.biclusters_[1]]
    ver_lines = list(np.cumsum(ver_lines[:n_clusters[0]]))

    for pp in range(len(hor_lines)-1):
        cm.ax_heatmap.hlines(hor_lines[pp],0,X_plot.shape[1], colors='r')

    for pp in range(len(ver_lines)-1):
        cm.ax_heatmap.vlines(ver_lines[pp],0,X_plot.shape[0], colors='r')

    # title
    title = name+' - '+str(n_clusters[1])+'-'+str(n_clusters[0])
    plt.title(title)
    cm.savefig(title,dpi=300)
    plt.show() 

    # save models
    models_all.append(model)

# compare models    
score = consensus_score(models_all[0].biclusters_, models_all[1].biclusters_)
print("consensus score between: {:.1f}".format(score))    
4

0 回答 0