0

我有一个大型数据集(12,000 行 X 14 列);前4行如下:

x1  y1  0.02    NAN NAN NAN NAN NAN NAN 0.004   NAN NAN NAN NAN
x2  y2  NAN 0.003   NAN 10  NAN 0.03    NAN 0.004   NAN NAN NAN NAN
x3  y3  NAN NAN NAN NAN NAN NAN NAN NAN NAN NAN NAN NAN
x4  y4  NAN 0.004   NAN NAN NAN NAN 10  NAN NAN 30  NAN 0.004

我需要删除第 3-14 列中带有“NAN”的任何行,然后输出数据集的其余部分。我写了以下代码:

#!usr/bin/perl

use warnings;
use strict;
use diagnostics;

open(IN, "<", "file1.txt") or die "Can't open file for reading:$!";

open(OUT, ">", "file2.txt") or die "Can't open file for writing:$!";

my $header = <IN>;
print OUT $header;

my $at_line = 0;

my $col3;
my $col4;
my $col5;
my $col6;
my $col7;
my $col8;
my $col9;
my $col10;
my $col11;
my $col13;
my $col14;
my $col15;

while (<IN>){
chomp;
my @sections = split(/\t/);

$col3 = $sections[2];
$col4 = $sections[3];;
$col5 = $sections[4];
$col6 = $sections[5];
$col7 = $sections[6];
$col8 = $sections[7];
$col9 = $sections[8];
$col10 = $sections[9];
$col11 = $sections[10];
$col13 = $sections[11];
$col14 = $sections[12];
$col15 = $sections[13];

if ($col3 eq "NAN" && $col4 eq "NAN" && $col5 eq "NAN" && $col6 eq "NAN" && $col7 eq "NAN" && $col8 eq "NAN" && $col9 eq "NAN" && $col10 eq "NAN" 
&& $col11 eq "NAN" && $col12 eq "NAN" && $col13 eq "NAN" && $col14 eq "NAN" && $col5 eq "NAN"){
    $at_line = $.;
    }   
    else {
        print OUT "$_\n";
    }
}

close(IN);
close(OUT);

运行此代码出现以下错误:

Use of uninitialized value $col3 in string eq at filter.pl
    line 46, <IN> line 2 (#1)

我怎样才能使这个程序工作?谢谢。

4

3 回答 3

4

单线:

$ perl -lane 'print if join("", @F[2..13]) ne "NAN" x 12' <file1.txt >file2.txt
于 2013-09-09T10:13:13.193 回答
4

Zaid 的单线是您特定情况的最佳解决方案。一般来说,与其定义这么多标量,您的模式应该是

my @required_columns = (split /\s+/)[2..13]

您遇到的错误似乎是由于您在数据集以空格分隔时在选项卡上进行拆分。请记住,split采用正则表达式而不是字符串。

于 2013-09-09T11:55:26.110 回答
1
while (<IN>) {
    my @values = (split( /\s+/)[2..13];
    my $nan_count = grep { $_ eq 'NAN' } @values;
    print $_ unless $nan_count == 12;
}

Joseph R. 有正确的分割线的方法。

grep在标量上下文中调用时返回匹配数,因此这提供了另一种检查是否所有列都等于 NAN 的方法。

于 2013-09-11T16:40:49.777 回答