regex - 正则表达式从字典哈希（Perl）中简单替换文档

Question

我需要尽快从大型文档中的哈希中查找和替换关键字。我厌倦了以下两种方法，一种快 320%，但我确信我这样做是错误的，并且肯定有更好的方法来做到这一点。

我只想替换字典哈希中存在的关键字并保留那些不存在的关键字，这样我就知道它不在字典中。

下面的两种方法都扫描两次以查找和替换我的想法。我相信像向前或向后这样的正则表达式可以更快地优化它。

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw(:all);

my %dictionary = (
            pollack => "pollard",
            polynya => "polyoma",
            pomaces => "pomaded",
            pomades => "pomatum",
            practic => "praetor",
            prairie => "praised",
            praiser => "praises",
            prajnas => "praline",
            quakily => "quaking",
            qualify => "quality",
            quamash => "quangos",
            quantal => "quanted", 
            quantic => "quantum",
    );

my $content =qq{
        Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that 
        can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
        Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
        one kind of problem {qualify} {doesNotExist} end.
    };

# just duplicate content many times
$content .= $content;

cmpthese(100000, {
    replacer_1 => sub {my $text = replacer1($content)},
    replacer_2 => sub {my $text = replacer2($content)},
});

print replacer1($content) , "\n--------------------------\n";
print replacer2($content) , "\n--------------------------\n";
exit;

sub replacer1 {
    my ($content) = shift;
    $content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
    return $content;
}

sub replacer2 {
    my ($content) = shift;
    my @names = $content =~ /\{(.+?)\}/g;
    foreach my $name (@names) {
        if (exists $dictionary{$name}) {
            $content =~ s/\{$name\}/\[$dictionary{$name}\]/;
        }
    }
    return $content;
}

这是基准测试结果：

              Rate replacer_2 replacer_1
replacer_2  5565/s         --       -76%
replacer_1 23397/s       320%         --

score 3 · Accepted Answer

它有助于构建一个匹配任何哈希键的正则表达式。像这样

my $pattern = join '|', sort {length $b <=> length $a } keys %dictionary;
$pattern = qr/$pattern/;

sub replacer4 {
    my ($string) = @_;
    $string =~ s# \{ ($pattern) \} #"[$dictionary{$1}]"#gex;
    $string;
}

有了这些结果

              Rate replacer_2 replacer_1 replacer_3 replacer_4
replacer_2  4883/s         --       -80%       -84%       -85%
replacer_1 24877/s       409%         --       -18%       -22%
replacer_3 30385/s       522%        22%         --        -4%
replacer_4 31792/s       551%        28%         5%         --

如果您可以在散列中使用大括号和方括号，而不是每次都添加它们，它也会有所改进。

score 3 · Accepted Answer

这是一种更快、更紧凑的方法：

sub replacer3 {
    my ($content) = shift;
    $content =~ s#\{(.+?)\}#"[".($dictionary{$1} // $1)."]"#ge;
    return $content;
}

在 Perl 5.8 中，如果您的字典值都不是“假” ，则可以使用它来||代替。//

使用已经包含大括号和方括号的字典还有一些好处：

sub replacer5 {
    my ($content) = shift;
    our %dict2;
    if (!%dict2) {
        %dict2 = map { "{".$_."}" => "[".$dictionary{$_}."]" } keys %dictionary
    }
    $content =~ s#(\{.+?\})#$dict2{$1} || $1#ge;
    return $content;
}

基准测试结果：

              Rate replacer_2 replacer_1 replacer_3 replacer_5
replacer_2  2908/s         --       -79%       -83%       -84%
replacer_1 14059/s       383%         --       -20%       -25%
replacer_3 17513/s       502%        25%         --        -7%
replacer_5 18741/s       544%        33%         7%         --

score 2 · Accepted Answer

我建议为您的基准测试子例程使用有意义的名称，它会使输出和意图更加清晰。

下面复制了一些鲍罗丁和暴徒尝试过的东西，然后也将它们结合起来。

#!/usr/bin/perl

use strict;
use warnings;
use feature 'state';

use Benchmark qw(:all);

# Data separated by paragraph mode.
my %dictionary = split ' ', do {local $/ = ''; <DATA>};
my $content = do {local $/; <DATA>};

# Quadruple Content
$content = $content x 4;

cmpthese(100_000, {
    original        => sub { my $text = original($content) },
    build_list      => sub { my $text = build_list($content) },
    xor_regex       => sub { my $text = xor_regex($content) },
    list_and_xor    => sub { my $text = list_and_xor($content) },
});

exit;

sub original {
    my $content = shift;
    $content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
    return $content;
}

sub build_list {
    my $content = shift;
    state $list = join '|', map quotemeta, keys %dictionary;
    $content =~ s/\{($list)\}/[$dictionary{$1}]/gx;
    return $content;
}

sub xor_regex {
    my $content = shift;

    state $with_brackets = {
        map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
    };

    $content =~ s{(\{.+?\})}{$with_brackets->{$1} // $1}gex;

    return $content;
}

sub list_and_xor {
    my $content = shift;

    state $list = join '|', map quotemeta, keys %dictionary;
    state $with_brackets = {
        map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
    };

    $content =~ s{(\{(?:$list)\})}{$with_brackets->{$1} // $1}gex;

    return $content;
}

__DATA__
pollack pollard
polynya polyoma
pomaces pomaded
pomades pomatum
practic praetor
prairie praised
praiser praises
prajnas praline
quakily quaking
qualify quality
quamash quangos
quantal quanted 
quantic quantum

Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that 
can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
one kind of problem {qualify} {doesNotExist} end.

输出：

                Rate     original    xor_regex   build_list list_and_xor
original     19120/s           --         -23%         -24%         -29%
xor_regex    24938/s          30%           --          -1%          -8%
build_list   25253/s          32%           1%           --          -7%
list_and_xor 27027/s          41%           8%           7%           --

我的解决方案大量使用state变量来避免重新初始化静态数据结构。但是，也可以使用闭包或our $var; $var ||= VAL.

关于增强正则表达式的 LHS 的附录

实际上，编辑 LHS 以使用显式列表是关于改进正则表达式。而这一变化表明速度提高了 30%。

对此不可能有任何神奇的解决方案。您有一个要替换的值列表。好像没有什么神秘的方法可以简化这个目标的语言。

如果字典哈希中不存在该单词，您也许可以使用 LHS 中的代码块来失败并跳过。但是，以下显示这实际上比您的原始方法慢 36%：

sub skip_fail {
    my $content = shift;

    $content =~ s{\{(.+?)\}(?(?{! $dictionary{$1}})(*SKIP)(*FAIL))}{[$dictionary{$1}]}gx;

    return $content;
}

输出：

                Rate   skip_fail    original   xor_regex build_list list_and_xor
skip_fail     6769/s          --        -36%        -46%       -49%         -53%
original     10562/s         56%          --        -16%       -21%         -27%
xor_regex    12544/s         85%         19%          --        -6%         -14%
build_list   13355/s         97%         26%          6%         --          -8%
list_and_xor 14537/s        115%         38%         16%         9%           --

regex - 正则表达式从字典哈希（Perl）中简单替换文档

3 回答 3

关于增强正则表达式的 LHS 的附录

Related

Reference