scikit-learn - Scikit HDBSCAN tree 标签（不是单片标签）

Question

BLUF：对于特定的 epsilon（或 HDBSCAN 的“最喜欢的” epsilon），我可以提取我的数据在该 epsilon 分区中的映射。但是我怎样才能看到我的数据的完整树成员？

我从这里的精彩教程中收获颇丰。在 scikit learn 的 HDBSCAN 中，我可以使用clusterer.labels查看最佳 epsilon 的分区标签。我可以clusterer.single_linkage_tree_.get_clusters(0.023, min_cluster_size=2)用来查看任意 epsilon 的分区标签。我什至可以使用clusterer.condensed_tree_.plot(). 但是如何查看各个数据点的树状图标签？

例如：很高兴我的宠物的名字是 {Spot, Felix, Nemo, Fido, Tigger}。或者物种是{Dog, Cat, Guppy, Dog, Cat}。但我想要一个告诉我的输出：


点	狗	哺乳动物	动物
菲利克斯	猫	哺乳动物	动物
尼莫	孔雀鱼	鱼	动物
菲多	狗	哺乳动物	动物
跳跳虎	猫	哺乳动物	动物

通过这种输出，我可以准确地看到Spot和 Felix 的相关性，而不是“他们有相同的物种吗？是/否？” “他们有同一个王国吗？是/否？”

score 2 · Accepted Answer

该clusterer.condensed_tree_对象具有许多转换实用程序，例如to_pandas()和to_networkx()。对于这个特定的用例，您似乎想要为压缩树中的每个叶节点打印一个祖先列表。您可以通过多种方式完成此操作，但一种非常简单的方法是将树转换为networkx图形并使用其上的实用方法来提取您正在寻找的结构：

import hdbscan
import networkx as nx
import numpy as np

# run HDBSCAN
data = np.load('clusterable_data.npy')
clusterer = hdbscan.HDBSCAN(min_cluster_size=15).fit(data)

# convert tree to networkx graph
tree = clusterer.condensed_tree_.to_networkx()
assert nx.algorithms.tree.recognition.is_tree(tree)

# find the root by picking an arbitrary node and walking up
root = 0
while True:
    try:
        root = next(tree.predecessors(root))
    except StopIteration:
        break

# create the ancestor list for each data point
all_ancestors = []
for leaf_node in range(len(data)):
    ancestors = nx.shortest_path(tree, source=root, target=leaf_node)[::-1]
    all_ancestors.append(ancestors)

打印all_ancestors会给你类似的东西：

[[0, 2324, 2319, 2317, 2312, 2311, 2309],
 [1, 2319, 2317, 2312, 2311, 2309],
 [2, 2319, 2317, 2312, 2311, 2309],
 [3, 2333, 2324, 2319, 2317, 2312, 2311, 2309],
 [4, 2324, 2319, 2317, 2312, 2311, 2309],
 [5, 2334, 2332, 2324, 2319, 2317, 2312, 2311, 2309],
 ...
 [995, 2309],
 [996, 2318, 2317, 2312, 2311, 2309],
 [997, 2318, 2317, 2312, 2311, 2309],
 [998, 2318, 2317, 2312, 2311, 2309],
 [999, 2318, 2317, 2312, 2311, 2309],
 ...]

每个列表中的第一个条目是节点 ID（对应于data数组中节点的索引），第二个条目是该节点的父节点，依此类推直到根节点（在本例中，ID 为 2309 ）。请注意，任何大于您拥有的数据项数量的节点 ID 都是“集群节点”（即树的内部节点），任何较低的节点 ID 都是“数据点节点”（即树）。

通过将节点分类到它们的集群中，可能会更容易理解这种列表格式，例如：

all_ancestors.sort(key=lambda path: path[1:])

现在打印all_ancestors会给你类似的东西：

[[21, 2309],
 [126, 2309],
 [152, 2309],
 [155, 2309],
 [156, 2309],
 [172, 2309],
 ...
 [1912, 2313, 2311, 2309],
 [1982, 2313, 2311, 2309],
 [2014, 2313, 2311, 2309],
 [2028, 2313, 2311, 2309],
 [2071, 2313, 2311, 2309],
 ...
 [1577, 2337, 2314, 2310, 2309],
 [1585, 2337, 2314, 2310, 2309],
 [1591, 2337, 2314, 2310, 2309],
 [1910, 2337, 2314, 2310, 2309],
 [2188, 2337, 2314, 2310, 2309]]

有许多等效的方法可以获得相同的结果（例如，通过循环遍历由生成的 pandas 数据帧to_pandas()），但networkx对于您可能想要对树/DAG/图执行的几乎所有操作来说，这是一个自然的选择。

scikit-learn - Scikit HDBSCAN *tree* 标签（不是单片标签）

1 回答 1

Related

Reference

scikit-learn - Scikit HDBSCAN tree 标签（不是单片标签）