perl - 如何从列表中删除可以在列表中其他较长行中找到的行？

Question

我有一个文件list.txt，像这样：

cat
bear
tree
catfish
fish
bear

我需要删除文档中其他地方已经完全找到的任何行，或者作为重复行，或者在另一个较长的行中找到。例如，“bear”和“bear”行相同，因此删除其中一个；"cat" 完全可以在 "catfish" 中找到，所以 "cat" 被删除了。输出将如下所示：

catfish
tree
bear

如何删除所有重复行，包括在列表中较长行中找到的行？

到目前为止，我有这个：

#!/bin/bash
touch list.tmp
while read -r line
do
    found="$(grep -c $line list.tmp)"
    if [ "$found" -eq "1" ]
    then
        echo $line >> list.tmp
        echo $line" added"
    else
        echo "Not added."
fi
done < list.txt

score 3 · Accepted Answer

如果 O(N^2) 不打扰您：

#!/usr/bin/env perl

use strict;
use warnings;
use List::MoreUtils qw{any};

my @words;
for my $word (
    sort {length $b <=> length $a}
    do {
        my %words;
        my @words = <>;
        chomp @words;
        @words{@words} = ();
        keys %words;
    }
)
{
    push @words, $word unless do {
        my $re = qr/\Q$word/;
        any {m/$re/} @words;
    };
}

print "$_\n" for @words;

如果你想要 O(NlogN) 你必须使用某种 trie 方法。例如使用后缀树：

#!/usr/bin/env perl

use strict;
use warnings;
use Tree::Suffix;

my $tree = Tree::Suffix->new();

my @words;
for my $word (
    sort {length $b <=> length $a}
    do {
        my %words;
        my @words = <>;
        chomp @words;
        @words{@words} = ();
        keys %words;
    }
)
{
    unless ($tree->find($word)){
        push @words, $word;
        $tree->insert($word);
    };
}

print "$_\n" for @words;

score 2 · Accepted Answer

这需要对文件进行两次传递，但应该可以：

script.awk 的内容

NR==FNR {
  words[$1]++
  next
} 
{
  for (word in words) { 
    if (index ($1,word) == 0) { 
      words[word] 
    } 
    else { 
      delete words[word]
      words[$1] 
    } 
  }
}
END {
  for (left in words)
    print left
}

测试：

$ cat file
cat
bear
tree
catfish
fish
bear
$ awk -f script.awk file file
bear
catfish
tree

score 2 · Accepted Answer

我能想到一个相当不错的算法。我将在 Perl 中回答以保持结果足够有效。

对于每个单词，测试它是否是较大单词集中任何单词的子字符串。如果不是，则从集合中删除所有作为该单词子串的单词，并将该单词添加到集合中。

因为这通常意味着遍历所有值，所以我们不妨使用数组。为了加快速度，我们保持数组以递减的方式排序。这允许我们对集合中已经存在的每个单词进行一次测试。

use strict; use warnings;

my @words;
INPUT:
while (<>) {
  chomp;
  my $len = length;
  my $i = 0;

  # check larger words if they contain $_
  LARGER:
  for ( ; $i < @words ; $i++) {
    last LARGER if length $words[$i] < $len;
    next INPUT if 0 <= index $words[$i], $_; # the word was seen
  }

  # insert the new word
  splice @words, $i++, 0, $_;

  # remove words that are contained in new word
  for ( ; $i < @words ; $i++) {
    splice @words, $i--, 1 if 0 <= index $_, $words[$i]; # $i-- adjusts index for deletion
  }
}
print "$_\n" for @words;

0 <= index $a, $b是一种高效的书写方式$a =~ /\Q$b\E/。

这是 David W. 算法的概括。如果输入按字长递减排序，则两种实现都会产生相同的输出。

如果单词很短，但有很多不同的单词，最好记住所有可能的子串。这使我们能够快速检测到所见的单词，但是将单词添加到已知列表中的成本很高。

my %seen;  # used to detect seen words
my %words; # used to remember real words
while (<>) {
  chomp;
  next if exists $seen{$_};
  # so we didn't see it. Let's produce all substrings
  START: for (my $start = 0 ; $start < length() - 1 ; $start++) {
    LENGTH: for (my $length = length() - $start ; $length ; $length--) {
      my $substr = substr $_, $start, $length;
      delete $words{$substr};         # if this was a real word, it's now a substring
      last LENGTH if exists $seen{$substr};  # dont repeat yourself
      $seen{$substr} = undef;         # add the entry
    }
  }
  $words{$_} = undef;  # remember this word as a real word
}
undef %seen;  # free obscene amount of memory
print "$_\n" for keys %words;

score 2 · Accepted Answer

这可能对您有用（GNU sed）：

sed -r ':a;$!{N;ba};s/\b([^\n]+)\n(.*\1)/\2/;ta;s/(([^\n]+).*\n)(\2)\n?/\1/;ta' file

Slurp 在内存中的文件，然后删除在整个文件中向前和向后重复的单个单词。

score 1 · Accepted Answer

Just for fun, here is a shell script version. I cheat by using Perl to print the line length, though.

#!/bin/sh

touch list.tmp

# Schwartzian transform: add length as prefix for each line,
perl -nle 'print length, "\t", $_' list.txt |
# reverse sort by this prefix,
sort -rn |
# and discard the prefix
cut -f2- |
while read -r line; do
     grep -q "$line" list.tmp && continue
     echo "$line" >>list.tmp
done

score 1 · Accepted Answer

由于子字符串问题，这将非常困难。最初，我正在考虑对我的列表进行排序，cat并且catfish会彼此相邻，但请查看此列表::

bug
bear
calf
catbug
catbear

对此列表进行排序将无济于事。另外，这个呢？

concatenate
cat
bear
bug

我要离开cat吗？它已经在这个词了concatenate？

那这个呢：

cat
concatenate
bear
bug

在这种情况下，单词cat和concatenate都在列表中，因为cat在concatenate之前是列表中的第一个。由于没有单词已经是concatenate的一部分，因此它进入了列表。

除非我需要同时检查两种方式。我要添加到列表中的单词是否已经在列表中，并且是列表中已经包含在我正在查看的单词中的单词。

这不仅是一个定义不明确的问题，而且是一个乱七八糟的代码。编码实际上很简单，但它最终生成了一个 O ²类型的算法。这意味着列表大小翻倍会导致处理时间增加四倍。如果我能在一秒钟内处理 100 个单词，我需要 4 秒来完成 200 个单词，8 秒来完成 400 个单词，16 秒来完成 800 个单词。差不多20秒做1000字。

这是使用您的定义，其中顺序很重要。也就是说，如果cat来之前catbug，两者都在您的批准列表中，但如果catbug来之前cat，则cat不会进入列表：

#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
use Data::Dumper;

use constant {
    LIST_FILE => "text.txt",
};

open my $list_fh, "<", LIST_FILE;
my @approved_list;
while ( my $new_word = <list_fh> ) {
    chomp $new_word;
    my $new_word_in_list = 0;
    for my $word_already_in_list ( @approved_list ) {
        if ( $word_already_in_list =~ /\Q$new_word\E/ ) {
            # Word is already in the list or in a word in the list
            $new_word_in_list = 1;
            last;
        }
    }
    if ( not $new_word_in_list ) {
        push @approved_list, $new_word;
    }
}
say Dumper \@approved_list;

冥想

我之前意识到我可以使用grep而不是内部循环：

#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
use Data::Dumper;

use constant {
    LIST_FILE => "text.txt",
};

open my $list_fh, "<", LIST_FILE;
my @approved_list;
while ( my $new_word = <$list_fh> ) {
    chomp $new_word;
    if ( not grep { /\Q$new_word\E/ } @approved_list ) {
        push @approved_list, $new_word;
    }
}
say Dumper \@approved_list

该程序看起来更短，似乎只需要一个循环，但grep隐藏了内部循环。为了使 grep 工作，它仍然需要遍历数组中的每一个条目。这就是为什么我决定不使用grep，而是让内部循环更加明确。

但是，如果我可以使用字符串而不是数组来保留单词，并且我用某个字符分隔单词，我可以保证它不在单词中，该怎么办？也许我可以在字符串上使用正则表达式。这样会更有效率吗？

#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
use Data::Dumper;

use constant {
    LIST_FILE => "text.txt",
};

open my $list_fh, "<", LIST_FILE;
my $approved_list = "";
while ( my $new_word = <$list_fh> ) {
    chomp $new_word;
    if ( not $approved_list =~ /\Q$new_word\E/ ) {
        $approved_list = ( $approved_list ) ? "$approved_list\0$new_word" : $new_word;
    }
}
say Dumper split /\0/, $approved_list;

在上面，我将批准的单词列表放在一个名为$approved_list. 我假设单词不包含该字符，将单词与NUL 字符NUL分开。现在，我可以用新词 grep 标量。如果它还没有在中$approved_list，我会在它前面附加NUL字符 ( \0)。我稍后可以拆分NUL以再次返回列表。

使用正则表达式会更快吗？如果我批准的列表包含 1000 个单词，平均每个单词 5 个字符（可能更长，因为较长的单词比较短的单词更有可能）。那是一个 6000 个字符的字符串，我正在做一个正则表达式。这样更有效率吗？很难说。

有三种解决方案：

第一个使用隐式内部循环。
第二个用于grep隐藏内部循环。
第三个将单词列表串成一个由一个字符分隔的字符串，我绝对确定不在字符串中。（我的钱在NUL）。

唯一的判断方法是在所有三个上使用Benchmark::Timer之类的东西，看看哪个最有效——这可能会根据列表大小、单词等而改变。

perl - 如何从列表中删除可以在列表中其他较长行中找到的行？

6 回答 6

script.awk 的内容

测试：

冥想

Related

Reference