1

我有一个这样的空格分隔文件:

 First        Second        Third       Forth
 It               is        possible    to   
 do             this                    task
 with          regex        but         i
 don't          know        how         to 

我的任务是捕获每一行的所有单词并从中构造一个哈希。

但这是我的问题:任何列中的字段都可能为空(例如,第 3 行、第 3 字段)。

每行中的单词在其开头或结尾处按列名称对齐。(列名是第一行中的单词,例如First Second Third Forth

在我的示例中,单词在列中对齐到左侧(或列名的开头),并在First Third Forth列中对齐到右侧(或列名的末尾)Second

使用每行的哈希值,我必须创建如下格式的输出:

$hash{First} has Second-property $hash{Second}. It also has $hash{Third} and $hash{Forth}.

use File::Basename;
use locale;
open my $file, "<", $ARGV[0];
open my $file2,">>",fileparse($ARGV[0])."2.txt";
my @alls = <$file>;

sub Main{
my $first = shift @alls;
my $poses = First_And_Last($first);
my $curr_poses;
my $curr_hash;
#do{OutputLine($_->[0],$_->[1],$first)}for (@$poses);
my $result_array=[];
my @keys = qw(# Variable Type Len Format Informat Label);
for $word(@alls){
    $curr_poses=First_And_Last($word);
    undef ($curr_hash);
    $curr_hash = Take_Words($poses, $word, $curr_poses);
    push @{$result_array},$curr_hash; #AoH  
    }

#end of main
}

sub First_And_Last{
    #First_And_Last($str)
    my $str = shift;    
    my $begin;
    my $end;
    my $ref=[];
    while ($str=~m/(([\S\.]\s?)+\b|#)/g){       
        $begin = pos($str) - length($1);
        $end = pos($str);       
        push @{$ref},[$begin,$end];
        }               
    return $ref;
    }

sub Take_Words{
    #Take_Words($poses, $line,$current) 
    my $outref = {};
    my $ref = shift; #take the ref of offsets of words
    my $line = shift;# and the next line in file
    my $current = shift; # and this is the poses of current line
    my @keys = qw(# Variable Type Len Format Informat Label);
    do{$outref->{$_}=undef;}for(@keys);
    my $ethalon; #for $ref
    my $relativity; #for $current
    my $key; #for key in $outref
    my @ethalon = @{$ref};

    $ethalon = shift @ethalon;
    $relativity = shift @{$current};
    $key = shift @keys;

    while (defined($key) && defined($relativity)){
        if ($ethalon->[0] == $relativity->[0] || $ethalon->[1] == $relativity->[1]){    
                $outref->{$key} = substr($line, $relativity->[0],$relativity->[1] - $relativity->[0]);          

                $relativity = shift @{$current};
            }
            $ethalon = shift @ethalon;
            $key = shift @keys;         
        }


    return $outref;
    }
4

1 回答 1

2

这是我的算法,但它有点 C-ish:

  1. 确定每个列标题的起始位置并存储。

  2. 对于每一列:转到标题的起始位置。

  3. 向左走,直到经过两个连续的空格。

  4. 向右走两个字符,然后记住位置。

  5. 向右走,直到您通过两个连续的空间。

  6. 向左走两个字符,然后记住位置。

  7. 提取找到的边界之间的所有内容。

  8. 删除开头和结尾的空格。

  9. 存储在您的哈希中

  10. 从第 2 步开始重复

现在我们必须看看那个实现:

第1步:

my @starting;
{
  my @char = split m{}, <$file>; # split the first line into char array
  my $spacecount = 0;
  my $state = 1; # 1 : find start -- 0 : find end
  for (my $i = 0; $i < @char; $i++) {
    if ($state) { # find next non-space
      if ($char[$i] =~ /\s/) {
        next;
      } else {
        $state = not $state; # flip
        $spacecount = 0;
        push @starting, $i;
        next;
      }
    } else {
      if ($char[$i] =~ /\s/) {
        $spacecount++;
        if ($spacecount >= 2) {
          $state = not $state; # flip
          next;
        }
      } else {
        $spacecount = 0; # reset consecutive space counter
        next;
      }
    }
  }
}
于 2012-07-16T10:35:46.310 回答