perl - Perl：比较两个 CSV 文件并打印出差异

Question

我是 Perl 的菜鸟，我很难完成这项工作。我有两个单列 CSV 文件，我正在尝试将差异打印到第三个文件。

File1:
123
124
125
126

File2:
123
124
127

Expected Output:
125
126
127

这是我到目前为止所得到的，但它不起作用：

#!/usr/bin/perl

use strict;
use warnings;

my $sheet_1;
my $sheet_2;
my $count1 = 0;
my $count2 = 0;

my $file1 = 'file1.csv';
my $file2 = 'file2.csv';
my $file_out = 'output.csv';

open (FILE1, "<$file1")  or die "Couldn't open input file: $!"; 
open (FILE2, "<$file2")  or die "Couldn't open input file: $!"; 


while( <FILE1> ) {
  chomp;
  $count1++;
  #skip header;
  next unless $count1;
  my $row_1;
  @$row_1 = split( /,/, $_ );
  push @$sheet_1, $row_1;
}
@$sheet_1 = sort { $a->[0] <=> $b->[0] } @$sheet_1;

while( <FILE2> ) {
  chomp;
  $count2++;
  #skip header;
  next unless $count2;
  my $row_2;
  @$row_2 = split( /,/, $_ );
  push @$sheet_2, $row_2;
}

@$sheet_2 = sort { $a->[0] <=> $b->[0] } @$sheet_2;


OUTER: {
     foreach my $row_1 ( @$sheet_1 ) {
         foreach my $row_2 ( @$sheet_2 ) {
        if (@$row_1[0] eq @$row_2[0]){
        last OUTER
        }
        else{
        print "@$row_1[0]\n";
        }
        }
    }
}

close FILE1;
close FILE2;

score 2 · Accepted Answer

查看diff和comm。这些可能会做你想做的事。

现在问几个问题：

如果这些文件每行只有一个值，那么是什么使它们成为 CSV 文件？CSV 文件有多个用逗号分隔的列（CSV = 逗号分隔值）。有没有其他事情发生。
如果两个文件具有相同的值，但在两个不同的位置，您是否将其视为差异？想象一个包含三行的文件，并且这些行包含1, 2, 3. 您正在将它与其中的第二个文件进行比较1, 3, 2。第二行和第三行有区别吗？或者，文件是否相同，因为它们包含相同的值？

不，如果两个文件在不同的地方具有相同的值，则该值不应出现在输出中。在您的示例中，两个文件 (1,2,3) 和 (1,3,2) 是相同的。– Yoboy 7 小时前

很好...

每当您在第 2 组类型的问题中有来自第 1 组的项目时，您应该考虑一个哈希。

散列是一个值列表，其中每个值都有一个键。可以是列表中的重复值，但只能是特定键的单个实例。这意味着您可以轻松查看列表中是否已经存在某个键。

想象一下，获取文件 #1，并将每个值作为键放入哈希中。值是什么并不重要，您只对键感兴趣。

现在，当您浏览文件 #2 时，您可以快速查看该密钥是否已经在您的哈希中。如果是，则为重复值。

我们还可以利用散列的第二个特性：只允许一个键的单个实例。如果我们将两只苍蝇都扔到一个哈希中怎么办？如果一个值在文件 #1 和文件 #2 之间重复，那没关系，该键只能有一个实例。

这是一种获取两个文件中唯一值列表的方法：

use strict;
use warnings;
use feature qw(say);
use autodie;

use constant {
    FILE_1  => "file1.txt",
    FILE_2  => "file2.txt",
};

my %hash;
#
# Load the Hash with value from File #1
#
open my $file1_fh, "<", FILE_1;
while ( my $value = <$file1_fh> ) {
    chomp $value;
    $hash{$value} = 1;
}
close $file1_fh;
#
# Add File #2 to the Hash
#
open my $file2_fh, "<", FILE_2;
while ( my $value = <$file2_fh> ) {
    chomp $value;
    $hash{$value} = 1;   #If that value was in "File #1", it will be "replaced"
}
close $file2_fh;

#
# Now print out everything
#
for my $value ( sort keys %hash ) {
    say $value;
}

这将打印出：

你想要的是一个唯一值的列表。这比最初看起来要复杂一些。您可以将文件#1 的值放入哈希中，然后打印出文件#2 中的值（如果它们不在文件#1 中）。这将为您提供文件#2 中唯一值的列表，但不是文件#1 中的唯一值。

因此，您需要创建两个散列，一个用于 FIle #1，一个用于 File #2，然后遍历每个散列并相互比较：

use strict;
use warnings;
use feature qw(say);
use autodie;

use constant {
    FILE_1  => "file1.txt",
    FILE_2  => "file2.txt",
};

#
# Load Hash #1 with value from File #1
#
my %hash1;
open my $file1_fh, "<", FILE_1;
while ( my $value = <$file1_fh> ) {
    chomp $value;
    $hash1{$value} = 1;
}
close $file1_fh;

#
# Load Hash #2 with value from File #2
#
my %hash2;
open my $file2_fh, "<", FILE_2;
while ( my $value = <$file2_fh> ) {
    chomp $value;
    $hash2{$value} = 1;
}
close $file2_fh;

现在，我们需要将一个与另一个进行比较。我现在将值存储在一个数组中：

my @array;
#
# Check if File #1 has unique values vs File #2
#
for my $value ( %keys %hash1 ) {
   if ( not exists $hash2{$value} ) {
      push @array, $value;  #Value in File #1, but not in File #2
   }
}
#
# Check if File #2 has unique values vs File #1
#
for my $value ( %keys %hash2 ) {
   if ( not exists $hash1{$value} ) {
      push @array, $value;  #Value in File #2, but not in File #1
   }
}
#
# Now print out what's in @array of unique values
#
for my $value ( sort @array ) {
    say $value;
}

score 2 · Accepted Answer

您可以使用Text::Diff Perl 模块来执行此操作。否则，请参见下文：

这是一种进行比较的算法。

use strict;
my @arr1;
my @arr2;
my $a;

open(FIL,"a.txt") or die("$!");
while (<FIL>)
    {chomp; $a=$_; $a =~ s/[\t;, ]*//g; push @arr1, $a if ($a ne  '');};
close(FIL);

open(FIL,"b.txt") or die("$!");
while (<FIL>)
    {chomp; $a=$_; $a =~ s/[\t;, ]*//g; push @arr2, $a if ($a ne  '');};
close(FIL);

my %arr1hash;
my %arr2hash;
my @diffarr;
foreach(@arr1) {$arr1hash{$_} = 1; }
foreach(@arr2) {$arr2hash{$_} = 1; }

foreach $a(@arr1)
{
    if (not defined($arr2hash{$a})) 
     {
        push @diffarr, $a;
     }
}

foreach $a(@arr2)
{
   if (not defined($arr1hash{$a})) 
   { 
       push @diffarr, $a;
   }
}

print "Diff:\n";
foreach $a(@diffarr)
{
    print "$a\n";
}
# You can print to a file instead, by: print FIL "$a\n";

score 0 · Accepted Answer

如果它是单列的，则没有逗号可以拆分。你为什么这样做？只需在“\n”上拆分文件
不要重新发明轮子。如果它是具有多列的实际 CSV，请使用 Text::CSV::Slurp 之类的内容来读取它
您不是在查找项目时循环遍历每个文件的整体，而是使用散列作为查找。但是，如果您正在处理大文件，则可能会遇到内存问题。

IE：

use strict;
use warnings;
use 5.012;

use Text::CSV::Slurp;

my $file1_src=<<EOF;
id,field1,field2,field3
123,junk,"quoted junk",junk 
124,"quoted junk","quoted junk",junk 
125,junk,"quoted junk",junk 
126,junk,"quoted junk",junk 
EOF

my $file2_src=<<EOF;
id,field1,field2,field3
123,junk,"quoted junk",junk 
124,junk,"quoted junk",junk 
127,"quoted junk","quoted junk",junk
EOF

my %data1 = map { $_->{id} => 1 } @{Text::CSV::Slurp->load(string => $file1_src)};
my %data2 = map { $_->{id} => 1 } @{Text::CSV::Slurp->load(string => $file2_src)};

for my $id (keys %data1, keys %data2) {
  say $id unless $data1{$id} and $data2{$id};
}

perl - Perl：比较两个 CSV 文件并打印出差异

3 回答 3

Related

Reference