python - Python查找树状图的替代方法

Question

我有尺寸为 8000x100 的数据。我需要对这 8000 个项目进行聚类。我对这些物品的订购更感兴趣。对于小数据，我可以从上述代码中获得所需的结果，但对于更高维度，我不断收到运行时错误“RuntimeError：获取对象的 str 时超出了最大递归深度”。是否有另一种方法可以从“Z”获取重新排序的列。

from hcluster import pdist, linkage, dendrogram
import numpy
from numpy.random import rand

x = rand(8,100) # rand(8000,100) gives runtime error
Y = pdist(x)
Z = linkage(Y)
reorderedCol = dendrogram(Z)['ivl']


Traceback: 

>>> from hcluster import pdist, linkage, dendrogram
>>> import numpy
>>> from numpy.random import rand
>>> 

>>> x = rand(8000,100)
>>> Y = pdist(x)
>>> Z = linkage(Y)
>>> reorderedCol = dendrogram(Z)['ivl']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2062, in dendrogram
    link_color_func=link_color_func)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2342, in _dendrogram_calculate_info
    link_color_func=link_color_func)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2342, in _dendrogram_calculate_info
    link_color_func=link_color_func)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2342, in _dendrogram_calculate_info

...
...

  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2311, in _dendrogram_calculate_info
    link_color_func=link_color_func)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2209, in _dendrogram_calculate_info
    _append_singleton_leaf_node(Z, p, n, level, lvs, ivl, leaf_label_func, i, labels)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2091, in _append_singleton_leaf_node
    ivl.append(str(int(i)))
RuntimeError: maximum recursion depth exceeded while getting the str of an object
>>>

score 2 · Accepted Answer

问题是树状图是一种可视化技术。在 8000 个对象中，它已经变得几乎不可读了。这就是为什么它可能没有为此优化。

对于更大的数据集，我建议远离任何类型的分层集群（当使用矩阵运算实现时，它具有O(n^3)运行时，并且在某些情况下您可以在中执行它O(n^2)），而是使用例如OPTICS（维基百科）（并且不要使用Weka 中的 OPTICS 或浮动的 Python 版本 - 因为它们都不完整！）。

我什至不能跑dendrogram，我得到了错误matplotlib not available. Plot request denied。所以它实际上可能确实试图可视化树状图！如果它在优化可视化方面付出了很多努力，这很可能会耗尽内存。正如我在您的另一个问题中向您展示的那样，您自己做计算树状图叶子的排序，您应该能够避免这种额外的成本。

您使用hcluster而不是有原因scipy.cluster.hierarchy吗？

score 0 · Accepted Answer

但对于更高维度，我不断收到运行时错误“RuntimeError：获取对象的 str 时超出最大递归深度”

通过使用某种形式的降维技术（如PCA或tSNE）可以帮助解决内存问题

从 100 个维度减少到 20 个左右

运行 tSNE 需要时间，因此您可以使用 PCA（更快）从 100 暗淡到 50 暗淡（比如说），然后使用 tSNE 达到 10 左右暗淡。

当心：这些会导致数据丢失，但可能只是完成工作。

python - Python查找树状图的替代方法

2 回答 2

Related

Reference