perl - Perl 函数提取具有指定起始列和长度的数据

Question

我正在尝试编写一个代码，从文件 A 中提取数据并仅将具有指定起点和终点的列数据粘贴到文件 B 中。到目前为止，我只能成功地将所有数据从 A 复制到 B - 但不是通过过滤列得到任何地方。我试过研究 splice 和 grep 无济于事。没有Perl经验。数据没有列标题。示例：数据实际上长达数千行 - 无法将数据插入函数

1. AAA 565 u8y 221
2. ABC 454 9u8 352
3. ADH 115 i98 544
4. AKS 352 87y 454
5. GJS 154 i9k 141

我希望将第 3 列（开始：8 长度：3）的所有唯一值复制到文件 B 中。我已经尝试了如何在 Perl 中提取特定数据列中提供的解决方案？无济于事。

感谢您提供任何提示或帮助！

 #!/usr/bin/perl
use strict;
use warnings;

#use Cwd qw(abs_path);

#my $dir = '/home/
#$dir = abd_path($dir);
my $filename = "filea.txt";
my $newfilename = "fileb.txt"; 

#Open file to read raw data
open (DATA1, "<$filename") or die "Couldn't open $filename: $!";

#Open new file to copy desired columns
open (DATA2, ">$newfilename") or die "Couldn't open $newfilename: $!";

#Copy data from original to new file

while (<DATA1>) {
    #DATA2=splice(DATA1, 0,5);
    print DATA2 $_;
    my @fifth_column = map{(split)[1]} split /\n/, $newfilename;    
}

score 3 · Accepted Answer

如果我理解正确，您可以为此使用一个相当简单的脚本。

use strict;
use warnings;

my %seen;
while (<DATA>) {
    my $str = substr($_, 8, 3);   # the string you seek
    unless ($seen{$str}++) {      # if it is not seen before
        print "$str\n";           # ...print it
    }
}

__DATA__
AAA 565 u8y 221
AAA 565 u8y 221
ABC 454 9u8 352
ADH 115 i98 544
AKS 352 87y 454
GJS 154 i9k 141

输出：

u8y
9u8
i98
87y
i9k

文件句柄在DATA这里用于演示。我还在数据中添加了一个副本来演示重复数据删除。如果您更改<DATA>为，<>您可以像这样简单地使用脚本：

perl script.pl filea.txt > fileb.txt

请注意，这依赖于您的数据是固定宽度的，这意味着如果您的字段不对齐，您的输出将被损坏。

另请注意，这只是一个简单单行的完整版本，如下所示：

perl -nlwe '$x=substr($_,8,3); print $x unless $seen{$x}++' filea.txt > fileb.txt

score 1 · Accepted Answer

看看以下 Perl 命令：

split：这允许您将一行数据拆分为一个数组：

例子：

while ( my $line = <$input_fh> ) {
    my @items = split /\s+/, $line;   #Columns are separated by spaces or tabs
    my $third_column = $items[2];  #The column you want;
    blah...blah...blah;
}

substr：这允许您指定列信息的子字符串。如果您的列由制表符分隔，这可能没有那么有用。对于大多数非 Perl 开发人员来说，这是他们尝试的第一种方法。但是，我建议使用split.

有一个 Perl 技巧可以确保您的数据是唯一的：使用散列来存储您的信息。在散列中查找数据exists很快，如果您已经看过该数据，该函数可用于快速查找。将其与split：

use strict;
use warnings;
use autodie;

use constants {
    INPUT_FILE  => "filea.txt",
    OUTPUT_FILE => "fileb.txt",
};

open my $input_fh, "<", INPUT_FILE;
open my $output_fh ">", OUTPUT_FILE;

my %unique_columns;
while ( my $line = <$input_fh> ) {
    my @items = split /\s+/, $line;   #Columns are separated by spaces or tabs
    my $third_column = $items[2];  #The column you want;
    if ( not exists $unique_columns{$third_column} ) {
        $unique_columns{$third_column} = 1;
        print {$output_fh} "$third_column\n";
    }
}
close $output_fh;

%unique_columns哈希跟踪以查看您之前是否在文件的第三列中看到过该数据。将每个单独的键设置为等于什么都没关系。但是，我建议将其设置为非零或空白值，因为如果您这样做：

if ( $unique_columns{$data} )

代替

if ( exists $unique_columns{$data} )

只要的值$unique_columns{$data}不是零或空白，您的程序仍然可以工作，否则会失败。

score 0 · Accepted Answer

当谈到固定长度时，没有什么比打包/解包更好的了，学习这个教程，它会让你的生活更轻松，这项工作小菜一碟。

http://linux.die.net/man/1/perlpacktut

perl - Perl 函数提取具有指定起始列和长度的数据

3 回答 3

Related

Reference