perl - 如何使用 perl 索引语料库中的所有唯一词

Question

.i 1
.t
 effici machineindepend procedur 
garbag collect  variou list structur
.w
 method  return regist   free
list   essenti part   list process
system.   paper past solut   recoveri
problem  review  compar.  new algorithm
 present  offer signific advantag  speed
 storag util.  routin  implement
 algorithm   written   list languag 
       insur  degre
 machin independ. final  applic  
algorithm   number  differ list structur
appear   literatur  indic.
.b
cacm august 1967
.a
schorr h.
wait w. m.
.n
ca670806 jb februari 27 1978 428 pm
.x
1024 4 1549

1024 4 1549

1050 4 1549

.i 2
.t
 comparison  batch process  instant turnaround
.w
 studi   program effort  student
  introductori program cours  present
  effect  have instant turnaround   minut
 oppos  convent batch process
 turnaround time    hour  examin. 
 item compar   number  comput
run  trip   comput center program prepar
time keypunch time debug time
number  run  elaps time    run
   run   problem.   
result  influenc   fact  bonu point
 given  complet   program problem
    specifi number  run 
 evid  support instant  batch.
.b
cacm august 1967
.a
smith l. b.
.n
ca670805 jb februari 27 1978 432 pm
.x
1550 4 1550

1550 4 1550

1304 5 1550

1472 5 1550

现在上面的文本是 2 个文件的内容，这两个文件既停止又停止，新文件从 .i 开始（后跟一个数字）需要对 .t & .b 、 .b & 之间的文本中的单词进行索引。 a , .a & .n, .n &.x 并忽略 .x 和新文档开头之间的所有文本。即.I（后跟一个数字）

所有文件的内容都存储在一个文件中，比如“语料库”。需要对所有唯一单词进行索引，以及它们在语料库中出现的次数以及在每个文档中出现的次数，可能在文档中的哪些位置。

open FILE, '<', 'sometext.txt' or die $!;
my @texts = <FILE>;
foreach my $text(@texts) {
        my @lines = split ("\n",$text);
        foreach my $line(@lines) {
            my @words = split (" ",$text);
            foreach my $word(@words) {
                $word = trim($word);
                my $match = qr/$word/i;

                open STFILE, '<', 'sometext.txt' or die $!;
                my $count=0;

                while (<STFILE>) {
                    if ($_ =~ $match) {
                        my @mword = split /\s+/, $_;
                        $_ =~ s/[A-Za-z0-9_ ]//g;
                        for my $i (0..$#mword) {
                            if ($mword[$i] =~ $match) {
                                #print "match found on line $. word ", $i+1,"\n";
                                $count++
                            }
                        }
                    }
                }
                print "$word appears $count times \n";
                close(STFILE) or die "Couldn't close $file: $!\n\n";
            }
        }
    }


    close(FILE) or die "Couldn't close $file: $!\n\n";

    sub trim($)
{
    my $string = shift;
    $string =~ s/^\s+//;
    $string =~ s/\s+$//;
    return $string;
}

上面的代码计算语料库中每个单词的出现次数。如何更改它，以便它还计算单个文档中单词的出现次数。

score 2 · Accepted Answer

怎么样：

编辑为每个文档添加不同的计数器：

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $words;
my $doc;
my $file = 'path/to/file';
open my $fh, '<', $file or die "unable to open '$file' for reading:$!"
while(<$fh>) {
    chomp;
    $doc = $_ if /^\.i/;
    next if (/^\.x\b/ .. /^\.i\b/);
    next if /^\./;
    my @words = split;
    for(@words) {
        $words->{$_}{$doc}++;
    }
}
close $fh;
print Dumper $words;

score 1 · Accepted Answer

使用散列，散列值包含每个单词的当前计数。循环遍历所有行和所有单词。使用基于哑（标志变量）的状态机来忽略 .t 和 .b 之间的文本

如果您在编写上述任何代码时遇到困难，请发布有关您遇到什么问题的具体问题。

perl - 如何使用 perl 索引语料库中的所有唯一词

2 回答 2

Related

Reference