0

我有多个 csv 文件,我想合并所有这些文件......我在下面展示了一些我的示例 csv 文件......

M1DL1_Interpro_sum.csv

IPR017690,Outer membrane, omp85 target,821
IPR014729,Rossmann,327
IPR013785,Aldolase,304
IPR015421,Pyridoxal,224
IPR003594,ATPase,179
IPR000531,TonB receptor,150
IPR018248,EF-hand,10

M1DL2_Interpro_sum.csv

IPR017690,Outer membrane, omp85 target,728
IPR013785,Aldolase,300
IPR014729,Rossmann,261
IPR015421,Pyridoxal,189
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase,111

M1DL3_Interpro_sum.csv

IPR017690,Outer membrane,905
IPR013785,Aldolase,367
IPR014729,Rossmann,338
IPR015421,Pyridoxal,271
IPR003594,ATPase,158
IPR018248,EF-hand,3

现在要合并这些文件,我尝试了以下代码

@ARGV = <merge_csvfiles/*.csv>;
print @ARGV[0],"\n";
open(PAGE,">outfile.csv") || die"Can't open outfile.csv\n";
while($i<scalar(@ARGV))
{
open(FILE,@ARGV[$i]) || die"Can't open ...@ARGV[$i]...\n";
$data.=join("",<FILE>);

close FILE;
print"file completed...",$i+1,"\n";
$i++;
}


@data=split("\n",$data);
@data2=@data;

print scalar(@data);

for($i=0;$i<scalar(@data);$i++) 
{
@id1=split(",",@data[$i]);
$id_1=@id1[0];
@data[$j]=~s/\n//;
if(@data[$i] ne "")
{
    print PAGE "\n@data[$i],";
    for($j=$i+1;$j<scalar(@data2);$j++)
    {
        @id2=split(",",@data2[$j]);
        $id_2=@id2[0];
        if($id_1 eq $id_2)
        {

            @data[$j]=~s/\n//;
            print PAGE "@data2[$j],";
            @data2[$j]="";
            @data[$j]="";
            print "match found at ",$i+1," and ",$j+1,"\n";
        }
    }
}


print $i+1,"\n";
}

merge_csvfiles 是一个包含所有文件的文件夹

上面代码的输出是

IPR017690,Outer membrane,821,IPR017690,Outer membrane  ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,IPR003594,ATPase,158
IPR000531,TonB receptor,150
IPR018248,EF-hand,10,IPR018248,EF-hand,3
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase

但我想要以下格式的输出....

IPR017690,Outer membrane,821,IPR017690,Outer membrane  ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
IPR000531,TonB receptor,150,0,0,0,0,0,0
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
0,0,0,IPR011991,Winged,113,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0

有人知道我该怎么做吗?感谢您的帮助

4

1 回答 1

1

As mentioned in Miguel Prz's comment, you haven't explained how you want the merge to be performed, but, judging by the "desired output" sample, it appears that what you want is to concatenate lines with matching IDs from all three input files into a single line in the output file, with "0,0,0" taking the place of any lines which don't appear in a given file.

So, then:

#!/usr/bin/env perl    

use strict;
use warnings;

my @input_files = glob 'merge_csvfiles/*.csv';
my %data;
for my $i (0 .. $#input_files) {
  open my $infh, '<', $input_files[$i]
    or die "Failed to open $input_files[$i]: $!";
  while (<$infh>) {
    chomp;
    my $id = (split ',', $_, 2)[0];
    $data{$id}[$i] = $_;
  }
  print "Input file read: $input_files[$i]\n";
}

open my $outfh, '>', 'outfile.csv' or die "Failed to open outfile.csv: $!";
for my $id (sort keys %data) {
  my @merge_data;
  for my $i (0 .. $#input_files) {
    push @merge_data, $data{$id}[$i] || '0,0,0';
  }
  print $outfh join(',', @merge_data) . "\n";
}

The first loop collects all the lines from each file into a hash of arrays. The hash keys are the IDs, so the lines for that ID from all files are kept together, and the value for each key is (a reference to) an array of the line associated with that ID in each file; using an array for this allows us to keep track of values which are missing as well as those which are present.

The second loop then takes the keys of that hash (in alphabetical order) and, for each one, creates a temporary array of the values associated with that ID, substituting "0,0,0" for missing values, joins them into a single string, and prints that to the output file.

The results, in outfile.csv, are:

IPR000531,TonB receptor,150,0,0,0,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
0,0,0,IPR011991,Winged,113,0,0,0
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR017690,Outer membrane, omp85 target,821,IPR017690,Outer membrane, omp85 target,728,IPR017690,Outer membrane,905
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3

Edit: Added explanations requested by OP in comments

can u expalain me the working of my $id = (split ',', $_, 2)[0]; and $# in this program

my $id = (split ',', $_, 2)[0]; gets the text prior to the first comma in the last line of text that was read:

  • Because I didn't specify what variable to put the data in, while (<$infh>) reads it into the default variable $_.
  • split ',', $_, 2 splits up the value of $_ into a list of comma-separated fields. The 2 at the end tells it to only produce at most 2 fields; the code will work fine without the 2, but, since I only need the first field, splitting into more parts isn't necessary.
  • Putting (...)[0] around the split command turns the returned list of fields into an (anonymous) array and returns the first element of that array. It's the same as if I'd written my @fields = split ',', $_, 2; my $id = $fields[0];, but shorter and without the extra variable.

$#array returns the highest-numbered index in the array @array, so for my $i (0 .. $#array) just means "loop over the indexes for all elements in @array". (Note that, if I hadn't needed the value of the index counter, I would have instead looped over the array's data directly, by using for my $filename (@input_files), but it would have been less convenient to keep track of the missing values if I'd done it that way.)

于 2013-04-01T12:22:23.797 回答