perl - 寻找双打的话

Question

我将不得不（作为练习）编写一个 perl 程序，它检查文本文件中是否存在相同的单词，然后将它们打印到一个新文件中（没有双打）。

有人可以帮我吗。我知道使用 am// 函数可以查找单词，但是如何查找我可能不知道的单词呢？例如：如果文本文件有：

你好，你好，你好吗？ 我可能希望将此文件复制到没有“Hello”之一的新文件中。当然，我不会知道文件中是否有任何重复的单词......这就是程序搜索重复单词的想法。

我有一个按字母顺序对单词进行排序的基本脚本，但是查找重复单词的第 2 步……我不知道。这是脚本（希望到目前为止是正确的）：

#!/usr/bin/perl 
use strict;
use warnings;

my $source = shift(@ARGV);
my $cible = shift(@ARGV);

open (SOURCE, '<', $source) or die ("Can't open $source\n");
open (CIBLE, '>', $cible) or die ("Can't open $cible\n");

my @lignes = <SOURCE>;
my @lignes_sorted = sort (@lignes);

print CIBLE @lignes_sorted;

chomp @lignes;
chomp @lignes_sorted;

print "Original text : @lignes\n";

sleep (1);

print "Sorted text : @lignes_sorted\n"; 

close(SOURCE);
close (CIBLE);

score 1 · Accepted Answer

从句子中删除单词比听起来更复杂。例如，如果在空格上拆分句子，您将得到“单词”，例如Hello,包含非单词字符，并且算作与真实单词不重复Hello。有许多变量需要考虑，但假设最简单的情况是除空格之外的所有字符都构成合法单词，您可以这样做：

$ perl -anlwe '@F=grep !$seen{$_}++, @F; print "@F";' hello.txt
Hello, how are you?
yada Yada this is test material dupe Dupe

$ cat hello.txt
Hello, Hello, how are you?
yada Yada this is test material dupe dupe Dupe

如您所见，它不考虑yada和Yada重复。它也不会考虑Hello复制Hello,. 您可以通过添加使用lc或uc删除大小写依赖性来调整它，并允许使用不同的分隔符而不仅仅是空格。

我们在这里所做的是使用哈希%seen来跟踪之前出现过的单词。基本程序是：

while (<>) {         # reading input file or stdin
    @F = split;      # splitting $_ on whitespace by default
    @F = grep !$seen{$_}++, @F;   # remove duplicates
    print "@F";      # print array elements space-separated 
}

的功能!$seen{$_}++是第一次输入新键时，表达式将返回真，其他时候返回假。它是如何工作的？这些是发生的不同步骤：

$seen{$_}     # value for key $_ is fetched
$seen{$_}++   # value for key $_ is incremented, undef -> 1
              # $foo++ returns the value *before* it is incremented, 
              # so it returns undef
!$seen{$_}++  # this is now "! undef", meaning "not false", as in true.

对于 1 及以上的值，它们都为真，not运算符将它们全部否定为假。

score 0 · Accepted Answer

在 Perl 中：

#!/usr/bin/perl -w
use strict;

my $source = shift(@ARGV);
my $cible = shift(@ARGV);

open (SOURCE, '<', $source) or die ("Can't open $source\n");
open (CIBLE, '>', $cible) or die ("Can't open $cible\n");

my @input = sort <SOURCE>;
my %words = ();
foreach (@input) {
    foreach my $word (split(/\s/)) {
        print CIBLE $word." " unless ( exists $words{$word} );
        $words{$word} = 1;
    }
}

close(SOURCE);
close (CIBLE);

基本思想是将整个文本拆分为单个单词（使用split函数），然后以该单词为键构建哈希。阅读下一个单词时，只需检查该单词是否已经在哈希中。如果是 - 它是重复的。

对于Hello, Hello, how are you?它打印的字符串：Hello, how are you?.

score 0 · Accepted Answer

如果您不担心找到具有不同大小写的重复单词，那么您可以通过一次替换来做到这一点。

use strict;
use warnings;

my ($source, $cible) = @ARGV;

my $data;
{
    open ($source_fh, '<', $source) or die ("Can't open $source\n");
    local $/;
    $data = <$source_fh>;
}

$data =~ s/\b(\w+)\W+(?=\1\b)//g;

open (my $cible_fh, '>', $cible) or die ("Can't open $cible\n");
print $cible_fh $data;

score -1 · Accepted Answer

不知道如何在 Perl 中做到这一点，但可以使用 sed 和几个 fo Unix 实用程序轻松做到这一点。该算法将是：

通过用换行符替换空格来分隔所有单词
对单词进行排序
通过带有 -c 选项的 uniq 发送已排序的单词列表（单词计数）
删除所有出现一次的单词（第一列中的计数为 1）

该命令将执行为（将 \t 替换为 TAB，\n 替换为 ENTER）

sed 's/[ \t,.][ \t,.]*/\n/g' filename | sort | uniq -c | sed '/^  *\<1\>/d'

希望有帮助。

perl - 寻找双打的话

4 回答 4

Related

Reference