regex - Perl如何找到捕获的位置

Question

我有一个这样的空格分隔文件：

 First        Second        Third       Forth
 It               is        possible    to   
 do             this                    task
 with          regex        but         i
 don't          know        how         to

我的任务是捕获每一行的所有单词并从中构造一个哈希。

但这是我的问题：任何列中的字段都可能为空（例如，第 3 行、第 3 字段）。

每行中的单词在其开头或结尾处按列名称对齐。（列名是第一行中的单词，例如First Second Third Forth）

在我的示例中，单词在列中对齐到左侧（或列名的开头），并在First Third Forth列中对齐到右侧（或列名的末尾）Second

使用每行的哈希值，我必须创建如下格式的输出：

$hash{First} has Second-property $hash{Second}. It also has $hash{Third} and $hash{Forth}.

use File::Basename;
use locale;
open my $file, "<", $ARGV[0];
open my $file2,">>",fileparse($ARGV[0])."2.txt";
my @alls = <$file>;

sub Main{
my $first = shift @alls;
my $poses = First_And_Last($first);
my $curr_poses;
my $curr_hash;
#do{OutputLine($_->[0],$_->[1],$first)}for (@$poses);
my $result_array=[];
my @keys = qw(# Variable Type Len Format Informat Label);
for $word(@alls){
    $curr_poses=First_And_Last($word);
    undef ($curr_hash);
    $curr_hash = Take_Words($poses, $word, $curr_poses);
    push @{$result_array},$curr_hash; #AoH  
    }

#end of main
}

sub First_And_Last{
    #First_And_Last($str)
    my $str = shift;    
    my $begin;
    my $end;
    my $ref=[];
    while ($str=~m/(([\S\.]\s?)+\b|#)/g){       
        $begin = pos($str) - length($1);
        $end = pos($str);       
        push @{$ref},[$begin,$end];
        }               
    return $ref;
    }

sub Take_Words{
    #Take_Words($poses, $line,$current) 
    my $outref = {};
    my $ref = shift; #take the ref of offsets of words
    my $line = shift;# and the next line in file
    my $current = shift; # and this is the poses of current line
    my @keys = qw(# Variable Type Len Format Informat Label);
    do{$outref->{$_}=undef;}for(@keys);
    my $ethalon; #for $ref
    my $relativity; #for $current
    my $key; #for key in $outref
    my @ethalon = @{$ref};

    $ethalon = shift @ethalon;
    $relativity = shift @{$current};
    $key = shift @keys;

    while (defined($key) && defined($relativity)){
        if ($ethalon->[0] == $relativity->[0] || $ethalon->[1] == $relativity->[1]){    
                $outref->{$key} = substr($line, $relativity->[0],$relativity->[1] - $relativity->[0]);          

                $relativity = shift @{$current};
            }
            $ethalon = shift @ethalon;
            $key = shift @keys;         
        }


    return $outref;
    }

score 2 · Accepted Answer

这是我的算法，但它有点 C-ish：

确定每个列标题的起始位置并存储。
对于每一列：转到标题的起始位置。
向左走，直到经过两个连续的空格。
向右走两个字符，然后记住位置。
向右走，直到您通过两个连续的空间。
向左走两个字符，然后记住位置。
提取找到的边界之间的所有内容。
删除开头和结尾的空格。
存储在您的哈希中
从第 2 步开始重复

现在我们必须看看那个实现：

第1步：

my @starting;
{
  my @char = split m{}, <$file>; # split the first line into char array
  my $spacecount = 0;
  my $state = 1; # 1 : find start -- 0 : find end
  for (my $i = 0; $i < @char; $i++) {
    if ($state) { # find next non-space
      if ($char[$i] =~ /\s/) {
        next;
      } else {
        $state = not $state; # flip
        $spacecount = 0;
        push @starting, $i;
        next;
      }
    } else {
      if ($char[$i] =~ /\s/) {
        $spacecount++;
        if ($spacecount >= 2) {
          $state = not $state; # flip
          next;
        }
      } else {
        $spacecount = 0; # reset consecutive space counter
        next;
      }
    }
  }
}

regex - Perl如何找到捕获的位置

1 回答 1

Related

Reference