我正在实现一种算法,该算法计算每个集群的偏差,然后将具有最高偏差的集群拆分为新的集群。最终,我想找到具有最高偏差的集群,这意味着分类器要么在这些实例上产生更多的错误,要么产生更少的错误。
这是算法:
- 从整个数据集作为一个集群开始
- 用 KMeans 分成两个集群
- 计算每个集群的宏 F1 分数
- 计算这两个集群的偏差。偏差为:F1-score_cluster_k - F1-对除集群 k 之外的所有集群进行评分
- if Max(bias_cluster_i,bias_cluster_j) >=bias_previous_cluster:将集群cluster_i和cluster_j添加到列表中并删除之前的集群
- 从具有最高误差度量标准偏差的 cluster_list 继续处理集群。
- 使用 KMeans 将此集群拆分为 2 个集群并继续执行步骤 3
为了使该算法有效,我需要保存先前迭代的集群分配和 F 分数,以便能够在当前迭代中比较它们(步骤 5)。
- 我的解决方案之一是将 Pandas DF 中的集群分配保存为新列,然后将此列与新的集群分配进行比较,但是有没有更好的方法来防止这些集群分配被覆盖?
这是我的代码:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.datasets import load_wine
data = load_wine()
df_data = pd.DataFrame(data.data, columns=data.feature_names)
df_target = pd.DataFrame(data = data.target)
# Merging the datasets into one dataframe
all_data = df_data.merge(df_target, left_index=True, right_index=True)
all_data.rename( columns={0 :'target_class'}, inplace=True )
all_data.head()
# Dividing X and y into train and test data (small train data to gain more errors)
X_train, X_test, y_train, y_test = train_test_split(df_data, df_target, test_size=0.60, random_state=2)
# Training a RandomForest Classifier
model = RandomForestClassifier()
model.fit(X_train, y_train.values.ravel())
# Obtaining predictions
y_hat = model.predict(X_test)
# Converting y_hat from Np to DF
predictions_col = pd.DataFrame()
predictions_col['predicted_class'] = y_hat.tolist()
predictions_col['true_class'] = y_test
# Calculating the errors with the absolute value
predictions_col['errors'] = abs(predictions_col['predicted_class'] - predictions_col['true_class'])
# It doesn't matter whether the misclassification is between class 0 and 2 or between 0 and 1, it has the same error value.
predictions_col['errors'] = predictions_col['errors'].replace(2.0, 1.0)
# Adding predictions to test data
df_out = pd.merge(X_test, predictions_col, left_index = True, right_index = True)
# Scaling the features
scaled_matrix = StandardScaler().fit_transform(df_matrix)
# Calculating the errors of the instances in the clusters.
def F_score(results, class_number):
true_pos = results.loc[results["true_class"] == class_number][results["predicted_class"] == class_number]
true_neg = results.loc[results["true_class"] != class_number][results["predicted_class"] != class_number]
false_pos = results.loc[results["true_class"] != class_number][results["predicted_class"] == class_number]
false_neg = results.loc[results["true_class"] == class_number][results["predicted_class"] != class_number]
try:
precision = len(true_pos)/(len(true_pos) + len(false_pos))
except ZeroDivisionError:
return 0
try:
recall = len(true_pos)/(len(true_pos) + len(false_neg))
except ZeroDivisionError:
return 0
f_score = 2 * ((precision * recall)/(precision + recall))
return f_score
# Calculating the macro average F-score
def mean_f_score(results):
n_classes = results['true_class'].unique()
class_list = []
for i in range(0, n_classes-1):
class_i = F_score(results, i)
class_list.append(class_i)
mean_f_score = (sum(class_list))/n_classes
return(mean_f_score)
def calculate_bias(clustered_data, cluster_number):
cluster_x = clustered_data.loc[clustered_data["assigned_cluster"] == cluster_number]
remaining_clusters = clustered_data.loc[clustered_data["assigned_cluster"] != cluster_number]
# Bias definition:
return mean_f_score(remaining_clusters) - mean_f_score(cluster_x)
MAX_ITER = 10
cluster_comparison = []
# start with all instances in one cluster
# scaled_matrix
for i in range(1, MAX_ITER):
kmeans_algo = KMeans(n_clusters=2, **clus_model_kwargs).fit(scaled_matrix)
clustered_data = pd.DataFrame(kmeans_algo.predict(scaled_matrix), columns=['assigned_cluster'])
# Adding the assigned cluster to the column
# groups = pd.DataFrame(cluster_model.predict(df_data),columns=["group"])
# Calculating bias per cluster
for cluster in clustered_data:
negative_bias_0 = calculate_bias(clustered_data, 0)
negative_bias_1 = calculate_bias(clustered_data, 1)
# the code below doesn't work
if max(negative_bias_0, negative_bias_1) >= bias_prev_iteration: