perl - 将文本集合转换为矢量表示的 Perl 脚本

Question

输入文件为UTF8编码，每一行结构如下：

    C\tTEXT\n

其中C是一类文档（几个字符），\t是一个制表符，TEXT是一个字符序列，\n是一个换行符。

从每个 TEXT 中删除 HTML 标记和类似的标记、实体、不是字母的字符，并且每个文本都转换为单词序列，其中顺序并不重要。

然后从每个 TEXT 创建向量，其中向量的各个元素（属性）对应于文本集合中的单词，向量中的值将取决于单词在 TEXT 中的出现。这些值可以有两种类型：

A - number of occurrences of words (1 or 0) 
B - number of occurrences    of words (0 or more)

最后一个值向量是文档的类别。

如有必要，可以从所有文本中删除一起具有低频率的单词（例如，一个）。

字符数较少的单词也可以删除。

Example input file:
CLASS    One Class One
CLASS    One Two
2CLASS   two three
CLAS12   three

示例输出文件：

这些是脚本的参数（最小字长=1，字的最小出现次数=1，A）

输出：

      one two three
CLASS  2   0    0 
CLASS  1   1    0
2CLASS 0   1    1
CLAS12 0   0    1

我当前的代码：

请帮我。

#!/usr/bin/perl

use strict;
use encoding 'UTF-8';
use Data::Dumper;

my %vector = ();
my @vectors = ();
my ($string,$word);

open SOURCE, "<:encoding(UTF-8)", "source.txt" or die "File does not exist $!\n";

my($class,$hodnota);
while (my $line = <SOURCE>) {
  if($line=~ /^(\w+)\t(.+)\n/){  
    $string =$2; $class = $1;
    $string=~ s/[^a-zA-Z ]//g; 

      for $word ( split " +", $string )
      {
        $vector{$word}++;
      }

      $vector{"class"} = $class;
      push(@vectors, %vector)
   }

}          
    close S;

print Dumper( \@vectors );

score 1 · Accepted Answer

use strict; 
use warnings;
use Data::Dumper;

open my $in_data, shift(@ARGV);
my @array_of_hashes_of_hashes=(); 
#used array of hashes_of_hashes because you treated two instances of CLASS differently
#if they could be treated the same, a simple hash of hashes would work fine.

while (<$in_data>)
{  
    if ($_ =~ /^(\w+)\t(.+)\n/)
    {   
        my %temp_hash=();
        my @values=split(/ /,$2);

        foreach (@values)
        {
            $temp_hash{lc($_)}+=1; #so that one and One map to the same key
        }

        push @array_of_hashes_of_hashes, {$1 => \%temp_hash};
    }
}

print Dumper \@array_of_hashes_of_hashes; #just to show you what it looks like

我注意到您没有打印Classfrom的值CLASS One Class One，因此如果您想在打印所有内容时将其过滤掉。

score 1 · Accepted Answer

我建议如下：

chomp($line);
if ($line =~ /^(\w+)\t(.+)/){
    my $vector = {};
    my ($class, $string) = ($1, $2);
    for my $word (split /[^a-zA-Z]/, $string) {
        next if length($word) < $some_treshold; # $word is too short
        my $word_lc = lc($word);
        $vector{$word_lc}++;
        $all_words{$word_lc} = 1; # this has to be initialized before main loop, as $all_words = {};
    }
    $vector{"class"} = $class; # hopefully, no words will be "class"
    push(@vectors, %vector)
}

完成后，所有使用过的单词都可以通过找到keys %$all_words。希望我正确理解了您的需求。

perl - 将文本集合转换为矢量表示的 Perl 脚本

2 回答 2

Related

Reference