perl - How can I extract columns from a fixed-width format in Perl?

Question

I'm writing a Perl script to run through and grab various data elements such as:

1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000 
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

I can grab each line of this text file no problem.

I have working regex to grab each of those fields. Once I have the line in a variable, i.e. $line - how can I grab each of those fields and place them into their own variables even though they have different delimiters?

score 14 · Accepted Answer

此示例说明如何使用空格作为分隔符 ( split ) 或使用固定列布局 ( unpack ) 来解析行。如果您使用unpack大写字母（A10 等），则会为您删除空格。注意：正如 brian d foy 指出的那样，该split方法不适用于缺少字段的情况（例如，第二行数据），因为字段位置信息将丢失；unpack是要走的路，除非我们误解了您的数据。

use strict;
use warnings;

while (my $line = <DATA>){
    chomp $line;
    my @fields_whitespace = split m'\s+', $line;
    my @fields_fixed = unpack('a10 a10 a12 a28', $line);
}

__DATA__
1253592000                                                  
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

score 3 · Accepted Answer

使用我的模块DataExtract::FixedWidth。在 perl 中使用 Fixed Width 列时，它是功能最齐全且经过良好测试的。如果这还不够快，您可以通过unpack_string并消除对边界的启发式检测的需要。

#!/usr/bin/env perl
use strict;
use warnings;
use DataExtract::FixedWidth;
use feature ':5.10';

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
});

say join ('|',  @{$de->parse($_)}) for @rows;

    --alternatively if you want header info--

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
  , cols => [qw/timestamp field2 period field4/]
});

use Data::Dumper;
warn Dumper $de->parse_hash($_) for @rows;

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

score 0 · Accepted Answer

我不确定列名和格式，但您应该可以使用Text::FixedWidth根据自己的喜好调整此配方

use strict;
use warnings;
use Text::FixedWidth;

my $fw = Text::FixedWidth->new;
$fw->set_attributes(
    qw(
        timestamp undef  %10s
        field2    undef  %10s
        period    undef  %12s
        field4    undef  %28s
        )
);

while (<DATA>) {
    $fw->parse( string => $_ );
    print $fw->get_timestamp . "\n";
}

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

score -1 · Accepted Answer

如果所有字段都具有相同的固定宽度并使用空格进行格式化，则可以使用以下内容split：

@array = split / {1,N}/, $line;

N字段的with在哪里。这将为每个空字段产生一个空间。

score -1 · Accepted Answer

您可以拆分线路。看来您的分隔符只是空格？您可以按以下顺序执行操作：

@line = split(" ", $line);

这将匹配所有空格。然后，您可以通过 $line[0]、$line[1] 等进行边界检查和访问每个字段。

Split 也可以采用正则表达式而不是字符串作为分隔符。

@line = split(/\s+/, $line);

这可能会做同样的事情。

score -2 · Accepted Answer

固定宽度定界可以这样完成：

my @cols;
my %header;
$header{field1} = 0; // char position of first char in field
$header{field2} = 12;
$header{field3} = 15;

while(<IN>) {

   print chomp(substr $_, $header{field2}, $header{field3}); // value of field2 


}

我的 Perl 很生锈，所以我确信那里有语法错误。但这就是它的要点。

perl - How can I extract columns from a fixed-width format in Perl?

6 回答 6

Related

Reference