这并没有给这个对话增加很多内容,但是在与这个问题斗争了超过保证的时间之后(实际的集群是不可用的),我想我会添加我的实现作为另一个例子。它有一个叠加的散点图(因为我的数据集有多烦人),使用索引显示融化,以及一些美学调整。我希望这对某人有用。
输出图
这里没有使用列标题(我看到一个不同的线程想知道如何使用索引来做到这一点):
combined_array: ndarray = np.concatenate([dbscan_output.data, dbscan_output.labels.reshape(-1, 1)], axis=1)
cluster_data_df: DataFrame = DataFrame(combined_array)
if you want to use labelled columns:
column_names: List[str] = list(outcome_variable_names)
column_names.append('cluster')
cluster_data_df.set_axis(column_names, axis='columns', inplace=True)
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=['cluster'],
# value_vars is an optional param - by default it uses columns except the id vars, but I've included it as an example
# value_vars=['outcome_var_1', 'outcome_var_2', 'outcome_var_3', 'outcome_var_4', 'outcome_var_5', 'outcome_var_6']
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
生成的数据框(rows = sample_n x variable_n(在我的情况下为 1626 x 6 = 9756)):
指数 |
簇 |
心理测量_tst |
与平均值的标准差 |
0 |
0.0 |
结果_var_1 |
-1.276182 |
1 |
0.0 |
结果_var_1 |
-1.118813 |
2 |
0.0 |
结果_var_1 |
-1.276182 |
9754 |
0.0 |
结果_var_6 |
0.892548 |
9755 |
0.0 |
结果_var_6 |
1.420480 |
如果你想在融化中使用索引:
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=cluster_data_df.columns[-1],
# value_vars=cluster_data_df.columns[:-1],
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
这是图形代码:(使用列标题完成 - 请注意 y-axis=value_name, x-axis = var_name, hue = id_vars):
# plot graph grouped by cluster
sns.set_theme(style="ticks")
fig = plt.figure(figsize=(10, 10))
fig.set(font_scale=1.2)
fig.set_style("white")
# create boxplot
fig.ax = sns.boxplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', showfliers=False,
data=graph_data)
# set box alpha:
for patch in fig.ax.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .2))
# create scatterplot
fig.ax = sns.stripplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', data=graph_data,
dodge=True, alpha=.25, zorder=1)
# customise legend:
cluster_n: int = dbscan_output.n_clusters
## create list with legend text
i = 0
cluster_info: Dict[int, int] = dbscan_output.cluster_sizes # custom method
legend_labels: List[str] = []
while i < cluster_n:
label: str = f"cluster {i+1}, n = {cluster_info[i]}"
legend_labels.append(label)
i += 1
if -1 in cluster_info.keys():
cluster_n += 1
label: str = f"Unclustered, n = {cluster_info[-1]}"
legend_labels.insert(0, label)
## fetch existing handles and legends (each tuple will have 2*cluster number -> 1 for each boxplot cluster, 1 for each scatterplot cluster, so I will remove the first half)
handles, labels = fig.ax.get_legend_handles_labels()
index: int = int(cluster_n*(-1))
labels = legend_labels
plt.legend(handles[index:], labels[0:])
plt.xticks(rotation=45)
plt.show()
asds
请注意:我的大部分时间都花在调试 melt 功能上。我主要得到了错误"*only integer scalar arrays can be converted to a scalar index with 1D numpy indices array*"
。我的输出要求我将结果变量值表和集群 (DBSCAN) 连接起来,并且我会在 concat 方法中的集群数组周围放置额外的方括号。所以我有一个列,其中每个值都是一个不可见的 List[int],而不是一个普通的 int。这是非常利基的,但也许它会帮助某人。
- 项目清单