0

输入文件为UTF8编码,每一行结构如下:

    C\tTEXT\n

其中C是一类文档(几个字符),\t是一个制表符,TEXT是一个字符序列,\n是一个换行符。

从每个 TEXT 中删除 HTML 标记和类似的标记、实体、不是字母的字符,并且每个文本都转换为单词序列,其中顺序并不重要。

然后从每个 TEXT 创建向量,其中向量的各个元素(属性)对应于文本集合中的单词,向量中的值将取决于单词在 TEXT 中的出现。这些值可以有两种类型:

A - number of occurrences of words (1 or 0) 
B - number of occurrences    of words (0 or more)

最后一个值向量是文档的类别。

如有必要,可以从所有文本中删除一起具有低频率的单词(例如,一个)。

字符数较少的单词也可以删除。

Example input file:
CLASS    One Class One
CLASS    One Two
2CLASS   two three
CLAS12   three

示例输出文件:

这些是脚本的参数(最小字长=1,字的最小出现次数=1,A)

输出:

      one two three
CLASS  2   0    0 
CLASS  1   1    0
2CLASS 0   1    1
CLAS12 0   0    1

我当前的代码:

请帮我。

#!/usr/bin/perl

use strict;
use encoding 'UTF-8';
use Data::Dumper;

my %vector = ();
my @vectors = ();
my ($string,$word);

open SOURCE, "<:encoding(UTF-8)", "source.txt" or die "File does not exist $!\n";

my($class,$hodnota);
while (my $line = <SOURCE>) {
  if($line=~ /^(\w+)\t(.+)\n/){  
    $string =$2; $class = $1;
    $string=~ s/[^a-zA-Z ]//g; 

      for $word ( split " +", $string )
      {
        $vector{$word}++;
      }

      $vector{"class"} = $class;
      push(@vectors, %vector)
   }

}          
    close S;

print Dumper( \@vectors );
4

2 回答 2

1
use strict; 
use warnings;
use Data::Dumper;

open my $in_data, shift(@ARGV);
my @array_of_hashes_of_hashes=(); 
#used array of hashes_of_hashes because you treated two instances of CLASS differently
#if they could be treated the same, a simple hash of hashes would work fine.

while (<$in_data>)
{  
    if ($_ =~ /^(\w+)\t(.+)\n/)
    {   
        my %temp_hash=();
        my @values=split(/ /,$2);

        foreach (@values)
        {
            $temp_hash{lc($_)}+=1; #so that one and One map to the same key
        }

        push @array_of_hashes_of_hashes, {$1 => \%temp_hash};
    }
}

print Dumper \@array_of_hashes_of_hashes; #just to show you what it looks like

我注意到您没有打印Classfrom的值CLASS One Class One,因此如果您想在打印所有内容时将其过滤掉。

于 2013-05-22T20:20:09.540 回答
1

我建议如下:

chomp($line);
if ($line =~ /^(\w+)\t(.+)/){
    my $vector = {};
    my ($class, $string) = ($1, $2);
    for my $word (split /[^a-zA-Z]/, $string) {
        next if length($word) < $some_treshold; # $word is too short
        my $word_lc = lc($word);
        $vector{$word_lc}++;
        $all_words{$word_lc} = 1; # this has to be initialized before main loop, as $all_words = {};
    }
    $vector{"class"} = $class; # hopefully, no words will be "class"
    push(@vectors, %vector)
}

完成后,所有使用过的单词都可以通过 找到keys %$all_words。希望我正确理解了您的需求。

于 2013-05-22T18:50:36.580 回答