python - 如何比较集群？

Question

希望这可以用python完成！我对相同的数据使用了两个聚类程序，现在两者都有一个聚类文件。我重新格式化了文件，使它们看起来像这样：

Cluster 0:
Brucellaceae(10)
    Brucella(10)
        abortus(1)
        canis(1)
        ceti(1)
        inopinata(1)
        melitensis(1)
        microti(1)
        neotomae(1)
        ovis(1)
        pinnipedialis(1)
        suis(1)
Cluster 1:
    Streptomycetaceae(28)
        Streptomyces(28)
            achromogenes(1)
            albaduncus(1)
            anthocyanicus(1)

etc.

这些文件包含细菌种类信息。所以我有簇号（簇 0），然后在它下面的“家族”（布鲁氏菌科）和该家族中的细菌数量（10）。下面是在该科中发现的属（名称后跟编号，布鲁氏菌（10）），最后是每个属中的物种（流产（1）等）。

我的问题：我有 2 个以这种方式格式化的文件，并且想编写一个程序来查找两者之间的差异。唯一的问题是两个程序以不同的方式聚类，所以两个聚类可能相同，即使实际的“聚类编号”不同（因此一个文件中的聚类 1 的内容可能与另一个文件中的聚类 43 匹配，唯一不同的是实际的簇号）。所以我需要一些东西来忽略集群编号并专注于集群内容。

有什么办法可以比较这两个文件来检查差异吗？甚至可能吗？任何想法将不胜感激！

score 1 · Accepted Answer

所以只是为了帮助，因为我在评论中看到了很多不同的答案，我会给你一个非常非常简单的脚本实现，你可以从它开始。

请注意，这并不能回答您的全部问题，而是为您指出评论中的一个方向。

通常，如果您没有经验，我会争辩先阅读Python（无论如何我都会这样做，我会在答案的底部添加一些链接）

到有趣的东西！:)

class Cluster(object):
  '''
  This is a class that will contain your information about the Clusters.
  '''
  def __init__(self, number):
    '''
    This is what some languages call a constructor, but it's not.
    This method initializes the properties with values from the method call.
    '''
    self.cluster_number = number
    self.family_name = None
    self.bacteria_name = None
    self.bacteria = []

#This part below isn't a part of the class, this is the actual script.
with open('bacteria.txt', 'r') as file:
  cluster = None
  clusters = []
  for index, line in enumerate(file):
    if line.startswith('Cluster'):
      cluster = Cluster(index)
      clusters.append(cluster)
    else:
      if not cluster.family_name:
        cluster.family_name = line
      elif not cluster.bacteria_name:
        cluster.bacteria_name = line
      else:
        cluster.bacteria.append(line)

我把它写得既愚蠢又过于简单，没有任何花哨的东西，对于 Python 2.7.2，你可以将这个文件复制到一个.py文件中，然后直接从命令行运行它python bacteria.py。

希望这对您有所帮助，如果您有任何问题，请随时来我们的 Python 聊天室！:)

score 1 · Accepted Answer

鉴于：

file1 = '''Cluster 0:
 giant(2)
  red(2)
   brick(1)
   apple(1)
Cluster 1:
 tiny(3)
  green(1)
   dot(1)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')
file2 = '''Cluster 18:
 giant(2)
  red(2)
   brick(1)
   tomato(1)
Cluster 19:
 tiny(2)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')

这是你需要的吗？

def parse_file(open_file):
    result = []

    for line in open_file:
        indent_level = len(line) - len(line.lstrip())
        if indent_level == 0:
            levels = ['','','']
        item = line.lstrip().split('(', 1)[0]
        levels[indent_level - 1] = item
        if indent_level == 3:
            result.append('.'.join(levels))
    return result

data1 = set(parse_file(file1))
data2 = set(parse_file(file2))

differences = [
    ('common elements', data1 & data2),
    ('missing from file2', data1 - data2),
    ('missing from file1', data2 - data1) ]

要查看差异：

for desc, items in differences:
    print desc
    print 
    for item in items:
        print '\t' + item
    print

印刷

common elements

    giant.red.brick
    tiny.blue.candy
    tiny.blue.flower

missing from file2

    tiny.green.dot
    giant.red.apple

missing from file1

    giant.red.tomato

score 1 · Accepted Answer

您必须编写一些代码来解析文件。如果忽略集群，您应该能够根据缩进区分科、属和种。

定义命名元组的最简单方法：

import collections
Bacterium = collections.namedtuple('Bacterium', ['family', 'genera', 'species'])

您可以像这样创建此对象的实例：

b = Bacterium('Brucellaceae', 'Brucella', 'canis')

您的解析器应该逐行读取文件，并设置族和属。如果它找到一个物种，它应该将一个细菌添加到列表中；

with open('cluster0.txt', 'r') as infile:
    lines = infile.readlines()
family = None
genera = None
bacteria = []
for line in lines:
    # set family and genera.
    # if you detect a bacterium:
    bacteria.append(Bacterium(family, genera, species))

获得每个文件或集群中所有细菌的列表后，您可以从所有细菌中进行选择，如下所示：

s = [b for b in bacteria if b.genera == 'Streptomycetaceae']

score 1 · Accepted Answer

比较两个聚类并非易事，重新发明轮子不太可能成功。查看这个包，它有很多不同的集群相似度指标，并且可以比较树状图（你拥有的数据结构）。

该库名为 CluSim，可在此处找到： https ://github.com/Hoosier-Clusters/clusim/

score 0 · Accepted Answer

在从 Stackoverflow 学到了这么多之后，我终于有机会回馈了！与目前提供的方法不同的方法是重新标记集群以最大化对齐，然后比较变得容易。例如，如果一个算法将标签分配给一组六个项目，如 L1=[0,0,1,1,2,2] 而另一种算法分配 L2=[2,2,0,0,1,1]，您希望这两个标签是等价的，因为 L1 和 L2 本质上是将项目分割成相同的集群。这种方法重新标记 L2 以最大化对齐，在上面的示例中，将导致 L2==L1。

我在“Menéndez, Héctor D. Agenetic approach to the graph and spectrum clustering problem. MS thesis. 2012”中找到了解决这个问题的方法。以下是使用 numpy 在 Python 中的实现。我对 Python 比较陌生，所以可能会有更好的实现，但我认为这可以完成工作：

def alignClusters(clstr1,clstr2):
"""Given 2 cluster assignments, this funciton will rename the second to 
   maximize alignment of elements within each cluster. This method is 
   described in in Menéndez, Héctor D. A genetic approach to the graph and 
   spectral clustering problem. MS thesis. 2012. (Assumes cluster labels
   are consecutive integers starting with zero)

   INPUTS:
   clstr1 - The first clustering assignment
   clstr2 - The second clustering assignment

   OUTPUTS:
   clstr2_temp - The second clustering assignment with clusters renumbered to
   maximize alignment with the first clustering assignment """
K = np.max(clstr1)+1
simdist = np.zeros((K,K))

for i in range(K):
    for j in range(K):
        dcix = clstr1==i
        dcjx = clstr2==j
        dd = np.dot(dcix.astype(int),dcjx.astype(int))
        simdist[i,j] = (dd/np.sum(dcix!=0) + dd/np.sum(dcjx!=0))/2
mask = np.zeros((K,K))
for i in range(K):
    simdist_vec = np.reshape(simdist.T,(K**2,1))
    I = np.argmax(simdist_vec)
    xy = np.unravel_index(I,simdist.shape,order='F')
    x = xy[0]
    y = xy[1]
    mask[x,y] = 1
    simdist[x,:] = 0
    simdist[:,y] = 0
swapIJ = np.unravel_index(np.where(mask.T),simdist.shape,order='F')
swapI = swapIJ[0][1,:]
swapJ = swapIJ[0][0,:]
clstr2_temp = np.copy(clstr2)
for k in range(swapI.shape[0]):
    swapj = [swapJ[k]==i for i in clstr2]
    clstr2_temp[swapj] = swapI[k]
return clstr2_temp

python - 如何比较集群？

5 回答 5

Related

Reference