bash - 如何在不排序的情况下删除两个文件之间的公共线？

Question

我有两个未分类的文件，它们有一些共同点。

文件1.txt

Z
B
A
H
L

文件2.txt

S
L
W
Q
A

我用来删除公共行的方式如下：

sort -u file1.txt > file1_sorted.txt
sort -u file2.txt > file2_sorted.txt

comm -23 file1_sorted.txt file2_sorted.txt > file_final.txt

输出：

B
H
Z

问题是我想保持file1.txt的顺序，我的意思是：

期望的输出：

Z
B
H

我认为的一个解决方案是循环读取 file2.txt 的所有行，并且：

sed -i '/^${line_file2}$/d' file1.txt

但是如果文件很大，性能可能会很糟糕。

你喜欢我的想法吗？
你有其他选择吗？

score 26 · Accepted Answer

您可以只使用 grep （-v用于反转，-f用于文件）。从中的Grep 行input1与以下中的任何行都不匹配input2：

grep -vf input2 input1

给出：

Z
B
H

score 25 · Accepted Answer

25

grep 或 awk：

awk 'NR==FNR{a[$0]=1;next}!a[$0]' file2 file1

于 2014-06-20T09:49:04.767 回答

score 4 · Accepted Answer

我已经编写了一个小 Perl 脚本，用于这种事情。它可以做的不仅仅是你要求的，但它也可以做你需要的：

#!/usr/bin/env perl -w
use strict;
use Getopt::Std;
my %opts;
getopts('hvfcmdk:', \%opts);
my $missing=$opts{m}||undef;
my $column=$opts{k}||undef;
my $common=$opts{c}||undef;
my $verbose=$opts{v}||undef;
my $fast=$opts{f}||undef;
my $dupes=$opts{d}||undef;
$missing=1 unless $common || $dupes;;
&usage() unless $ARGV[1];
&usage() if $opts{h};
my (%found,%k,%fields);
if ($column) {
    die("The -k option only works in fast (-f) mode\n") unless $fast;
    $column--; ## So I don't need to count from 0
}

open(my $F1,"$ARGV[0]")||die("Cannot open $ARGV[0]: $!\n");
while(<$F1>){
    chomp;
    if ($fast){ 
    my @aa=split(/\s+/,$_);
    $k{$aa[0]}++;   
        $found{$aa[0]}++;
    }
    else {
    $k{$_}++;   
        $found{$_}++;
    }
}
close($F1);
my $n=0;
open(F2,"$ARGV[1]")||die("Cannot open $ARGV[1]: $!\n");
my $size=0;
if($verbose){
    while(<F2>){
        $size++;
    }
}
close(F2);
open(F2,"$ARGV[1]")||die("Cannot open $ARGV[1]: $!\n");

while(<F2>){
    next if /^\s+$/;
    $n++;
    chomp;
    print STDERR "." if $verbose && $n % 10==0;
    print STDERR "[$n of $size lines]\n" if $verbose && $n % 800==0;
    if($fast){
        my @aa=split(/\s+/,$_);
        $k{$aa[0]}++ if defined($k{$aa[0]});
        $fields{$aa[0]}=\@aa if $column;
    }
    else{
        my @keys=keys(%k);
        foreach my $key(keys(%found)){
            if (/\Q$key/){
            $k{$key}++ ;
            $found{$key}=undef unless $dupes;
            }
        }
    }
}
close(F2);
print STDERR "[$n of $size lines]\n" if $verbose;

if ($column) {
    $missing && do map{my @aa=@{$fields{$_}}; print "$aa[$column]\n" unless $k{$_}>1}keys(%k);
    $common &&  do map{my @aa=@{$fields{$_}}; print "$aa[$column]\n" if $k{$_}>1}keys(%k);
    $dupes &&   do map{my @aa=@{$fields{$_}}; print "$aa[$column]\n" if $k{$_}>2}keys(%k);
}
else {
    $missing && do map{print "$_\n" unless $k{$_}>1}keys(%k);
    $common &&  do map{print "$_\n" if $k{$_}>1}keys(%k);
    $dupes &&   do map{print "$_\n" if $k{$_}>2}keys(%k);
}
sub usage{
    print STDERR <<EndOfHelp;

  USAGE: compare_lists.pl FILE1 FILE2

      This script will compare FILE1 and FILE2, searching for the 
      contents of FILE1 in FILE2 (and NOT vice versa). FILE one must 
      be one search pattern per line, the search pattern need only be 
      contained within one of the lines of FILE2.

    OPTIONS: 
      -c : Print patterns COMMON to both files
      -f : Search only the first characters of each line of FILE2
      for the search pattern given in FILE1
      -d : Print duplicate entries     
      -m : Print patterns MISSING in FILE2 (default)
      -h : Print this help and exit
EndOfHelp
      exit(0);
}

在你的情况下，你会运行它

list_compare.pl -cf file1.txt file2.txt

该-f选项使其仅比较 file2 的第一个单词（由空格定义）并大大加快速度。要比较整行，请删除-f.

bash - 如何在不排序的情况下删除两个文件之间的公共线？

3 回答 3

Related

Reference