python - 元组列表中的 Python 聚类变量同时按 2 个因素

Question

大家好，我有以下代码：

from math import sqrt
array = [(1,'a',10), (2,'a',11), (3,'c',200), (60,'a',12), (70,'t',13), (80,'g',300), (100,'a',305), (220,'c',307), (230,'t',306), (250,'g',302)]


def stat(lst):
    """Calculate mean and std deviation from the input list."""
    n = float(len(lst))
    mean = sum([pair[0] for pair in lst])/n
##    mean2 = sum([pair[2] for pair in lst])/n
    stdev = sqrt((sum(x[0]*x[0] for x in lst) / n) - (mean * mean))
##    stdev2 = sqrt((sum(x[2]*x[2] for x in lst) / n) - (mean2 * mean2)) 

    return mean, stdev

def parse(lst, n):
    cluster = []
    for i in lst:
        if len(cluster) <= 1:    # the first two values are going directly in
            cluster.append(i)
            continue
###### add also the distance between lengths
        mean,stdev = stat(cluster)
        if (abs(mean - i[0]) > n * stdev):   # check the "distance"
            yield cluster
            cluster[:] = []    # reset cluster to the empty list

        cluster.append(i)
    yield cluster           # yield the last cluster

for cluster in parse(array, 7):
    print(cluster)

它的作用是通过查看变量 i[0] 将我的元组列表（数组）聚集在一起。我还想实现的是进一步通过变量 i[2] 在我的每个元组中对其进行聚类。

当前输出为：

[(1, 'a', 10), (2, 'a', 11), (3, 'c', 200)]
[(60, 'a', 12), (70, 't', 13), (80, 'g', 300), (100, 'a', 305)]
[(220, 'c', 307), (230, 't', 306), (250, 'g', 302)]

我想要这样的东西：

[(1, 'a', 10), (2, 'a', 11)]
[(3, 'c', 200)]
[(60, 'a', 12), (70, 't', 13)]
[(80, 'g', 300), (100, 'a', 305)]
[(220, 'c', 307), (230, 't', 306), (250, 'g', 302)]

所以 i[0] 的值很接近，而且 i[2] 也很接近。任何想法如何破解它？

score 0 · Accepted Answer

您可以第二次使用您的parse方法获得第一次运行的结果。在这种情况下，您将收到与您想要的不完全相同但非常相似的信息：

def stat(lst, index):
    """Calculate mean and std deviation from the input list."""
    n = float(len(lst))
    mean = sum([pair[index] for pair in lst])/n
    stdev = sqrt((sum(x[index]*x[index] for x in lst) / n) - (mean * mean))
    return mean, stdev

def parse(lst, n, index):
    cluster = []
    for i in lst:
        if len(cluster) <= 1:    # the first two values are going directly in
            cluster.append(i)
            continue
        mean, stdev = stat(cluster, index)
        if (abs(mean - i[index]) > n * stdev):   # check the "distance"
            yield cluster
            cluster[:] = []    # reset cluster to the empty list

        cluster.append(i)
    yield cluster           # yield the last cluster

for cluster in parse(array, 7, 0):
    for nc in parse(cluster, 3, 2):
        print nc

[(1, 'a', 10), (2, 'a', 11)]
[(3, 'c', 200)]
[(60, 'a', 12), (70, 't', 13)]
[(80, 'g', 300), (100, 'a', 305)]
[(220, 'c', 307), (230, 't', 306)]
[(250, 'g', 302)]

score 0 · Accepted Answer

首先，您计算方差的方式在数值上是不稳定的。E(X^2)-E(X)^2在数学上成立，但会扼杀数值精度。最坏的情况是你得到一个负值，sqrt然后失败。

你真的应该看看numpy哪个可以为你正确计算。

从概念上讲，您是否考虑过将数据视为二维数据空间？然后您可以将其变白，并运行例如 k-means 或任何其他基于向量的聚类算法。

标准差和平均值对于抽象到多个属性是微不足道的（查找“马氏距离”）。

python - 元组列表中的 Python 聚类变量同时按 2 个因素

2 回答 2

Related

Reference