linux - 读取目录中的多个文件并与另一个文件进行比较

Question

我有两个文件

File 1 in reading directory is of following format 

Read 1 A T
Read 3 T C
Read 5 G T
Read 7 A G
Read 10 A G
Read 12 C G

File 2 in directory contains

    Read 5 A G
    Read 6 T C
    Read 7 G A
    Read 8 G A
    Read 20 A T

文件 2 包含

我需要先读取文件 2 的位置，然后以水平方式从目录中打开的文件中打印出相应的值。如果该位置不匹配，则打印为“-”。上面的输出应该是

     1 2 3 4 5 6 7
Read T - C - T - G
Read - - - - G C A

我需要对所有文件执行此操作，并在另一行中以上述格式打印。所以输出将只有一个文件，行数等于文件数。我可以轻松地在 perl 中做到这一点吗？

score 0 · Accepted Answer

如果文件很小，您可以将它们读入内存：

#read input files
use IO::File;
my $file1_data;
open(my $file1_fh,"<","/path/file1.data") or die $!;
#read file1
while(my $line=<$file1_fh>){
  chomp($line);
  my ($read,$pos,$col1,$col2) = split(/ /,$line);
  $file1_data->{$pos} = [$col1,$col2];
}
#read file2
my $file2_data;
open(my $file2_fh,"<","/path/file2.data") or die $!;
while(my $line=<$file2_fh>){
  chomp($line);
  my ($read,$pos,$col1,$col2) = split(/ /,$line);
  $file2_data->{$pos} = [$col1,$col2];
}
#read pos file
my @positions;
while(my $pos=<$posfile_fh>){
  chomp($pos);  
  push(@positions,$pos)
}
foreach my $pos (@positions){
    print "$pos\t";
}
print "\n";
foreach my $pos (@positions){
    my $data = defined $file1_data->{$pos}->[0]?$file1_data->{$pos}->[0]:"-";
    print "$pos\t$data"
}
print "\n";
foreach my $pos (@positions){
    my $data = defined $file2_data->{$pos}->[0]?$file2_data->{$pos}->[0]:"-";
    print "$pos\t$data"
}
print "\n";

score 0 · Accepted Answer

据我所知，您仅使用第二个数据列。这是一个简单的 perl 程序，如果有任何问题，请随时提问。我使用了第三个输入文件，可以使用任意数量的文件。我将格式更改为42最后包含。

编码：

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;

# try to open format file
my $ffn = shift @ARGV or die "you didn't provide a format file name!\n";
open my $ffh, '<', $ffn;

# read format file
my @format = <$ffh>;
close $ffh;
chomp for @format; # get rid of newlines

# prepare output
print '     ' . join(' ' => @format) . "\n";

# iterate over all .txt files in the data directory
foreach my $data_fn (<data/*.txt>) {

    # iterate over all lines of the data file
    open my $data_fh, '<', $data_fn;
    my %data = ();
    foreach my $line (<$data_fh>) {

        # parse input lines (only)
        next unless $line =~ /Read (\d+) ([ACGT]) ([ACGT])/;
        my ($pos, $first, $second) = ($1, $2, $3);

        # store data
        $data{$pos} = $second;
    }

    # print summary
    print 'Read ' . join(' ' => map {$data{$_} // '-'} @format) . "\n";
}

输出：

$ perl bio.pl format.txt
     1 2 3 4 5 6 7 42
Read T - C - T - G -
Read - - - - G C A -
Read - C - T - - - A

！:)

linux - 读取目录中的多个文件并与另一个文件进行比较

2 回答 2

Related

Reference