perl - 使用 CLUTO 进行聚类时对输入数据进行数据预处理

Question

我试图根据它们的相似性（两个词之间）对一些词进行聚类我的数据的某些部分如下（它只是示例“animal.txt”，它与邻接矩阵相似）。

    cat dog horse ostrich 
cat  5    4    3    2
dog  4    5    1    2
horse 3   1    5    4
ostrich 2  2   4    5

数字越大意味着两个词之间的相似度越高。基于这种格式数据，我想做一个集群。（例如，如果我想创建 2 个集群，那么结果将是（猫，狗），（马，鸵鸟））。

我尝试使用 CLUTO... 制作一些集群。

首先，我必须在进行 CLUTO 聚类之前重新构建输入文件。所以，我使用了 doc2mat ( http://glaros.dtc.umn.edu/gkhome/files/fs/sw/cluto/doc2mat.html ).. 但我不知道如何正确使用它来制作 CLUTO输入文件（如垫子、标签文件）在制作 CLUTO 输入文件后，如何根据上述数据制作集群？

score 0 · Accepted Answer

由于您的数据是邻接矩阵，因此相应的 CLUTO 输入文件是所谓的GraphFile，而不是MatrixFile，因此doc2mat无济于事。

该程序txt2graph.pl将您的示例“animal.txt”之类的文件转换为图形文件和行标签文件：

#!/usr/bin/perl
@F = split ' ', <>;             # begin reading txt file, read column headers
($GraphFile = $ARGV) =~ s/(.txt)?$/.graph/;
$LabelFile = $GraphFile.".rlabel";
open LABEL, ">$LabelFile";
open GRAPH, ">$GraphFile";
print GRAPH $#F+1, "\n";        # output number of vertices=objects=columns=rows
while (<>)
{                               # process each object row
    @F = split ' ', $_, 2;      # split into name, numbers
    print LABEL shift @F, "\n"; # output name
    print GRAPH @F;             # output numbers
}

CLUTO 聚类完成后，该程序pclusters.pl以您想要的输出格式打印结果：

#!/usr/bin/perl
($LabelFile = $ARGV[0]) =~ s/(.clustering.\d+)?$/.rlabel/;
open LABEL, $LabelFile; chomp(@label = <LABEL>); close LABEL;   # read labels
while (<>)
{
    $cluster[$_] = [] unless $cluster[$_];      # initialize a new cluster
    push $cluster[$_], $label[$.-1];            # add label to its cluster
}
foreach $cluster (@cluster)
{
    print "(", join(', ', @$cluster), ")\n";    # print a cluster's labels
}

那么整个过程是：

> txt2graph.pl animal.txt
> scluster animal.graph 2
> pclusters.pl animal.graph.clustering.2

perl - 使用 CLUTO 进行聚类时对输入数据进行数据预处理

1 回答 1

Related

Reference