perl - 搜索特定的重复 ID

Question

我编写了一个 perl 脚本，它读取 2 个不同的文件，比较这两个文件中的 ID，并且只打印 ID 匹配的数据。ID文件被读入一个数组，而数据文件被逐行读取。这一切都很好，但是现在我需要添加更多内容。在我的数据文件中，我有时会出现 ID 重复的行，因为该主题已经多次访问以提供样本。因此，我需要查找这些重复项并仅记录最近的访问日期。

所以我的数据文件看起来像这样：

   ID  DOV  Data1  Data2 etc etc

现在我已经看到哈希是搜索重复项的方法，但是我看到的所有修复都是简单地不加选择地删除重复项，这不是我想要的。

有任何想法吗？

score 0 · Accepted Answer

这将显示每个 ID 的最后一个 DOV，对输入数据做出很多假设，因此很有可能它不会为您开箱即用。（特别是，如果您的输入数据未按日期排序，则它根本不起作用，因为它只取每个 ID 看到的最后一个日期。此外，如果日期的格式包含空格，例如“ Mon Jul 9 15:51:22 CEST 2012"，它只会获取到第一个空格的日期（本例中为“Mon”）。这里的重点只是为了演示基本技术，而不是提供完整的解决方案.

#!/usr/bin/env perl    

use strict;
use warnings;

my %visit;
while (<DATA>) {
  my ($id, $date) = split;
  $visit{$id} = $date;
} 

for my $id (sort keys %visit) {
  print "$id => $visit{$id}\n";
} 

__DATA__
1       2012-01-01
2       2012-01-02
1       2012-02-03
3       2012-02-04
2       2012-03-05
3       2012-03-06
4       2012-04-07
1       2012-04-08
5       2012-05-09
1       2012-05-10

score 0 · Accepted Answer

# read id file
my %id_hash;
while (<IDFILE>) {
  chomp;
  $id_hash{$_} = 1;
}

#read data file
while (<DATAFILE>) {
  my @arr = split(/\s+/, $_);
  if (defined $id_hash{$arr[0]}) { # only process if exists in id file
    # and only if this is the first data entry or a later visit
    if ( (not ref $id_hash{$arr[0]}) or ($id_hash{$arr[0]}[1] < $arr[1]) ) {
      # store all data in an array ref
      $id_hash{$arr[0]} = [ @arr ];
    }
  }
}

for my $id (keys %id_hash) {
  print join(" ", @{$id_hash{$id}}), "\n";
}

perl - 搜索特定的重复 ID

2 回答 2

Related

Reference