csv - 比较 2 个 CSV 巨大的 CSV 文件并使用 perl 将差异打印到另一个 csv 文件

Question

我有 2 个包含多个字段（约 30 个字段）和巨大尺寸（约 4GB）的 csv 文件。

File1:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"

File2:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"

我想比较这两个文件并将差异报告到另一个 csv 文件中。在上面的示例中，Employee Vivek 和 Karthick 详细信息出现在不同的行号中，但记录数据仍然相同，因此应视为匹配。员工 Vinoth 记录应被视为不匹配，因为地址不匹配。

输出 diff.csv 文件可以包含来自 File1 和 File 2 的不匹配记录，如下所示。

Diff.csv
EmployeeName,Age,Salary,Address
F1, Vinoth,12,2548.245,"140,North Street,India" 
F2, Vinoth,12,2548.245,"140,North Street,USA"

到目前为止，我已经编写了如下代码。在此之后，我很困惑选择二进制搜索或任何其他有效方式来选择哪个选项。请你帮助我好吗？

My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.

use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );

# Get Command Line Parameters

my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
  or die("Error in command line arguments\n");

if ( !defined $opts{'file1'} ) {
    die "CSV file --file1 not specified.\n";
}
if ( !defined $opts{'file2'} ) {
    die "CSV file --file2 not specified.\n";
}

my $file1 = $opts{'file1'};
my $file2 = $opts{'file2'};
my $file3 = 'diff.csv';

print $file2 . "\n";

my $csv1 =
  Text::CSV_XS->new(
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csv2 =
  Text::CSV_XS->new(
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csvout =
  Text::CSV_XS->new(
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );

open( my $fh1, '<:encoding(utf8)', $file1 )
  or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
  or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
  or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );

my $f1line   = undef;
my $f2line   = undef;
my $header1  = undef;
my $f1empty  = 'false';
my $f2empty  = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' ) {
    $f1line = $csv1->getline($fh1);
}
while (1) {

    if ( $f1empty eq 'false' ) {
        $f1line = $csv1->getline($fh1);
    }
    if ( !defined $f1line ) {
        $f1empty = 'true';
    }

    if ( $f1empty eq 'true' ) {
        last;
    }
    else {
        ## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in perl. Need help here.

        }
    }

print "End of Program" . "\n";

score 2 · Accepted Answer

将如此庞大的数据存储在数据库中是此类任务的最正确方法。至少建议使用SQLlite，但其他数据库MariaDB、MySQL、PostgreSQL也可以很好地工作。

下面的代码演示了如何在没有特殊模块的情况下实现所需的输出，但它没有考虑可能弄乱的输入数据。即使差异可能只是一个额外的空间，此脚本也会将数据记录报告为不同的。

除非您指定选项output，否则默认输出将进入控制台窗口。

注意：整个文件 #1 被读入内存，请耐心等待处理大文件可能需要一段时间。

use strict;
use warnings;
use feature 'say';

use Getopt::Long qw(GetOptions);
use Pod::Usage;

my %opt;
my @args = (
            'file1|f1=s',
            'file2|f2=s',
            'output|o=s',
            'debug|d',
            'help|?',
            'man|m'
        );

GetOptions( \%opt, @args ) or pod2usage(2);

print Dumper(\%opt) if $opt{debug};

pod2usage(1) if $opt{help};
pod2usage(-exitval => 0, -verbose => 2) if $opt{man};

pod2usage(1) unless $opt{file1};
pod2usage(1) unless $opt{file2};

unlink $opt{output} if defined $opt{output} and -f $opt{output};

compare($opt{file1},$opt{file2});

sub compare {
    my $fname1 = shift;
    my $fname2 = shift;

    my $hfile1 = file2hash($fname1);

    open my $fh, '<:encoding(utf8)', $fname2
        or die "Couldn't open $fname2";

    while(<$fh>) {
        chomp;
        next unless /^(.*?),(.*)$/;
        my($key,$data) = ($1, $2);
        if( !defined $hfile1->{$key} ) {
            my $msg = "$fname1 $key is missing";
            say_msg($msg);
        } elsif( $data ne $hfile1->{$key} ) {
            my $msg = "$fname1 $key,$hfile1->{$key}\n$fname2 $_";
            say_msg($msg);
        }
    }
}

sub say_msg {
    my $msg = shift;

    if( $opt{output} ) {
        open my $fh, '>>:encoding(utf8)', $opt{output}
            or die "Couldn't to open $opt{output}";

        say $fh $msg;

        close $fh;
    } else {
        say $msg;
    }
}

sub file2hash {
    my $fname = shift;
    my %hash;

    open my $fh, '<:encoding(utf8)', $fname
        or die "Couldn't open $fname";

    while(<$fh>) {
        chomp;
        next unless /^(.*?),(.*)$/;
        $hash{$1} = $2;

    }

    close $fh;

    return \%hash;
}

__END__

=head1 NAME

comp_cvs - compares two CVS files and stores differense 

=head1 SYNOPSIS

 comp_cvs.pl -f1 file1.cvs -f2 file2.cvs -o diff.txt

 Options:
    -f1,--file1 input CVS filename #1
    -f2,--file2 input CVS filename #2
    -o,--output output filename
    -d,--debug  output debug information
    -?,--help   brief help message
    -m,--man    full documentation

=head1 OPTIONS

=over 4

=item B<-f1,--file1>

Input CVS filename #1

=item B<-f2,--file2>

Input CVS filename #2

=item B<-o,--output>

Output filename

=item B<-d,--debug>

Print debug information.

=item B<-?,--help>

Print a brief help message and exits.

=item B<--man>

Prints the manual page and exits.

=back

=head1 DESCRIPTION

B<This program> accepts B<input> and processes to B<output> with purpose of achiving some goal.

=head1 EXIT STATUS

The section describes B<EXIT STATUS> codes of the program

=head1 ENVIRONMENT

The section describes B<ENVIRONMENT VARIABLES> utilized in the program

=head1 FILES

The section describes B<FILES> which used for program's configuration

=head1 EXAMPLES

The section demonstrates some B<EXAMPLES> of the code

=head1 REPORTING BUGS

The section provides information how to report bugs

=head1 AUTHOR

The section describing author and his contanct information

=head1 ACKNOWLEDGMENT

The section to give credits people in some way related to the code

=head1 SEE ALSO

The section describing related information - reference to other programs, blogs, website, ...

=head1 HISTORY

The section gives historical information related to the code of the program

=head1 COPYRIGHT

Copyright information related to the code

=cut

测试文件的输出

file1.cvs Vinoth,12,2548.245,"140,North Street,India"
file2.cvs Vinoth,12,2548.245,"140,North Street,USA"

score 1 · Accepted Answer

#!/usr/bin/env perl

use Data::Dumper;
use Digest::MD5;
use 5.01800;
use warnings;

my %POS;
my %chars;

open my $FILEA,'<',q{FileA.txt}
    or die "Can't open 'FileA.txt' for reading! $!";
open my $FILEB,'<',q{FileB.txt}
    or die "Can't open 'FileB.txt' for reading! $!";
open my $OnlyInA,'>',q{OnlyInA.txt}
    or die "Can't open 'OnlyInA.txt' for writing! $!";
open my $InBoth,'>',q{InBoth.txt}
    or die "Can't open 'InBoth.txt' for writing! $!";
open my $OnlyInB,'>',q{OnlyInB.txt}
    or die "Can't open 'OnlyInB.txt' for writing! $!";
<$FILEA>,
    $POS{FILEA}=tell $FILEA;
<$FILEB>,
    $POS{FILEB}=tell $FILEB;
warn Data::Dumper->Dump([\%POS],[qw(*POS)]),' ';

{ # Scan for first character of the records involved
    while (<$FILEA>) {
        $chars{substr($_,0,1)}++;
        };
    while (<$FILEB>) {
        $chars{substr($_,0,1)}--;
        };
    # So what characters do we need to deal with?
    warn Data::Dumper->Dump([\%chars],[qw(*chars)]),' ';
    };
my @chars=sort keys %chars;
{
    my %_h;
    # For each of the characters in our character set
    for my $char (@chars) {
        warn Data::Dumper->Dump([\$char],[qw(*char)]),' ';
        # Beginning of data sections
        seek $FILEA,$POS{FILEA},0;
        seek $FILEB,$POS{FILEB},0;
        %_h=();
        my $pos=tell $FILEA;
        while (<$FILEA>) {
            next
                unless (substr($_,0,1) eq $char);
            # for each record save the lengthAndMD5 as the key and its start as the value
            $_h{lengthAndMD5(\$_)}=$pos;
            $pos=tell $FILEA;
            };
        my $_s;
        while (<$FILEB>) {
            next
                unless (substr($_,0,1) eq $char);
            if (exists $_h{$_s=lengthAndMD5(\$_)}) { # It's a duplicate
                print {$InBoth} $_;
                delete $_h{$_s};
                }
            else { # (Not in FILEA) It's only in FILEB
                print {$OnlyInB} $_;
                }
            };
        # only in FILEA
        warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
        for my $key (keys %_h) { # Only in FILEA
            seek $FILEA,delete $_h{$key},0;
            print {$OnlyInA} scalar <$FILEA>;
            };
        # Should be empty
        warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
        };
    };

close $OnlyInB
    or die "Could NOT close 'OnlyInB.txt' after writing! $!";
close $InBoth
    or die "Could NOT close 'InBoth.txt' after writing! $!";
close $OnlyInA
    or die "Could NOT close 'OnlyInA.txt' after writing! $!";
close $FILEB
    or die "Could NOT close 'FileB.txt' after reading! $!";
close $FILEA
    or die "Could NOT close 'FileA.txt' after reading! $!";
exit;

    sub lengthAndMD5 {
        return sprintf("%8.8lx-%32.32s",length(${$_[0]}),Digest::MD5::md5_hex(${$_[0]}));
        };

__END__

csv - 比较 2 个 CSV 巨大的 CSV 文件并使用 perl 将差异打印到另一个 csv 文件

2 回答 2

Related

Reference