perl - A Perl script to process a CSV file, aggregating properties spread over multiple records

Question

Sorry for the vague question, I'm struggling to think how to better word it!

I have a CSV file that looks a little like this, only a lot bigger:

The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:

550672,1;2;3;4
656372,1;2
766153,1;4

etc.

I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!

score 4 · Accepted Answer

I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience

It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you

use strict;
use warnings;

use Text::CSV;

my $csv = Text::CSV->new;

open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
  $csv->parse($line) or die "Invalid data line";
  my ($key, $val) = $csv->fields;
  push @{ $data{$key} }, $val
}

for my $id (sort keys %data) {
  printf "%s,%s\n", $id, join ';', @{ $data{$id} };
}

output

550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3

score 3 · Accepted Answer

Firstly props for seeking an approach not a solution. As you've probably already found with perl, There Is More Than One Way To Do It.

The approach I would take would be;

use strict;  # will save you big time in the long run

my %ids      # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
  split line into ID and property variable  # google the split function
  append new property to existing properties for this id in the hash table  # If it doesn't exist already, it will be created
}

foreach my $key (keys %ids) {
  deduplicate properties
  print/display/do whatever you need to do with the result
}

This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem. A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.

Check out this question for a discussion on how to do the deduplication.

score 2 · Accepted Answer

Well, open the file as stdin in perl, assume each row is of two columns, then iterate over all lines using left column as hash identifier, and gathering right column into an array pointed by a hash key. At the end of input file you'll get a hash of arrays, so iterate over it, printing a hash key and assigned array elements separated by ";" or any other sign you wish.

and here you go

dtpwmbp:~ pwadas$ cat input.txt 
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
dtpwmbp:~ pwadas$ cat bb2.pl 
#!/opt/local/bin/perl

my %hash;
while (<>)
{
    chomp;
    my($key, $value) = split /,/;
    push @{$hash{$key}} , $value ;
}

foreach my $key (sort keys %hash)
{
     print $key . "," . join(";", @{$hash{$key}} ) . "\n" ;
}
dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl 
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
dtpwmbp:~ pwadas$

score 2 · Accepted Answer

perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'

score 1 · Accepted Answer

Another (not perl) way which incidentally is shorter and more elegant:

#!/opt/local/bin/gawk -f

BEGIN {FS=OFS=",";}

NF > 0 { IDs[$1]=IDs[$1] ";" $2; }

END { for (i in IDs) print i, substr(IDs[i], 2); }

The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and the OUTPUT FIELD SEPARATOR to the comma. The second line checks of we have more than zero fields and if you do it makes the ID ($1) number the key and $2 the value. You do this for all lines.

The END statement will print these pairs out in an unspecified order. If you want to sort them you have to option of asorti gnu awk function or connecting the output of this snippet with a pipe to sort -t, -k1n,1n.

perl - A Perl script to process a CSV file, aggregating properties spread over multiple records

5 回答 5

Related

Reference