regex - 有没有一种快速的方法来获取正则表达式匹配的字符索引？

Question

我正在创建一个 Perl 脚本，它必须处理数百万维基百科文章的标记 - 所以速度是一个问题。

我正在寻找的一件事是模板的出现，它总是看起来像这样：{{template}}。因为这些可能很复杂和嵌套，所以我需要分别找到开始和结束标记，并知道找到它们的字符索引。

所以这里有一些简单的代码（假设 $text 是包含模板的文本）：

my $matchIndex ;

my $startCount = 0 ;
my $endCount = 0 ;

# find all occurrences of template start and template end tags
while($text =~ m/(\{\{)|(\}\})/gs) {

    $matchIndex = $+[0] ;

    if (defined $1) {
        #this is the start of a template
        $startCount ++ ;
    } else {
        #this is the end of a template
        $endCount++ ;
    }
 }

这段代码真正奇怪的是，这$matchIndex = $+[0] ;行代码对效率有很大的影响，尽管它只是在数组中查找一个值。如果没有注释掉，一篇复杂的 Wikipedia 文章（包含 2000 个模板 - 疯狂但确实发生了）在 0m0.080 秒内处理。将其保持在那里会使时间增加到 0m2.646s。我勒个去？

也许这听起来像我在分裂头发，但这是在几个小时内处理维基百科或在几周内处理它之间的区别。

score 5 · Accepted Answer

你为什么使用正则表达式？您正在寻找文字文本 {{ 或 }} 的位置。Perl 有一个内置函数可以做到这一点：index。

由于您正在尝试解析 Wikipedia 条目，因此您需要处理嵌套的模板指令。这意味着，例如，您找到的第二组闭合卷曲不一定与第二组开放卷曲一起使用。在Perl条目的这一点中，第一个结束卷曲与第二个开始卷曲一起出现：

{{信息框编程语言
| latest_release_version = 5.10.0
| latest_release_date = {{发布日期|mf=yes|2007|12|18}}
| 图灵完备 = 是
}}

Perl 5.10 正则表达式可以为您处理这个问题，因为它们可以递归地匹配平衡文本，并且还有 Perl 模块可以做到这一点。不过，这将是一些工作。在你说出你想要完成的事情之前，很难给你任何建议。肯定有一个 mediawiki 解析器可以做你想做的事情。

我打算编写我的index()解决方案，但我没有。我不能让你的代码足够慢以至于它很重要。即使我完成了所有堆栈管理并打印了每个模板的内容，这两个解决方案对我来说几乎都是即时完成pos()的。@-我必须非常努力地让它运行得足够慢才能被测量，而且我在一些旧硬件上。您可能需要以其他方式调整您的应用程序。

你确定你正在测量的代码在你认为的那一点上变慢了吗？您是否使用Devel::NYTProf 对其进行了概要分析，以了解您的实际程序在做什么？

#!/usr/bin/perl
use strict;
use warnings;

use Benchmark;

my $text = do { local $/; <DATA> }; # put the contents after __END__

my %subs = (
    using_pos     => sub {
        my $page = shift;

        my @stack;
        my $found;
        while( $$page =~ m/ ( \{\{ | }} ) /xg ) {           
            if( $1 eq '{{' ) { push @stack, pos($$page) - 2; }
            else             
                { 
                my $start = pop @stack;
                print STDERR "\tFound at $start: ", substr( $$page, $start, pos($$page) - $start ), "\n";
                $found++;
                };
            }

        print " Processed $found templates => ";
        },

    using_special => sub {
        my $page = shift;

        my @stack;
        my $found;
        while( $$page =~ m/ ( \{\{ | }} ) /xg ) {           
            if( $1 eq '{{' ) { push @stack, $-[0]; }
            else             
                { 
                my $start = pop @stack;
                print STDERR "\tFound at $start: ", substr( $$page, $start, $-[0] - $start ), "\n";
                $found++;
                };
            }

        print " Processed $found templates => ";
        },

    );

foreach my $key ( keys %subs )
    {
    printf "%15s => ", $key;

    my $t = timeit( 1, sub{ $subs{$key}->( \$text ) } );
    print timestr($t), "\n";
    }

我的 17" MacBook Pro 上的 perl：

macbookpro_brian[349]$ perl -V
我的perl5（revision 5 version 8 subversion 8）配置总结：
  平台：
    osname=darwin，osvers=8.8.2，archname=darwin-2level
    uname='darwin macbookpro.local 8.8.2 darwin kernel version 8.8.2: thu sep 28 20:43:26 pdt 2006; 根：xnu-792.14.14.obj~1release_i386 i386 i386 '
    config_args='-des'
    提示=推荐，useposix=true，d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define useocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    使用mymalloc=n，bincompat5005=undef
  编译器：
    cc='cc', ccflags ='-fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -I/opt/local/包括'，
    优化='-O3',
    cppflags='-no-cpp-precomp -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -I/opt/local/包括'
    ccversion='', gccversion='4.0.1 (Apple Computer, Inc. build 5363)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define，longlongsize=8，d_longdbl=define，longdblsize=16
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8，原型=定义
  链接器和库：
    ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -L/usr/local/lib -L/opt/local/lib'
    libpth=/usr/local/lib /opt/local/lib /usr/lib
    库=-ldbm -ldl -lm -lc
    perllibs=-ldl -lm -lc
    libc=/usr/lib/libc.dylib，so=dylib，useshrplib=false，libperl=libperl.a
    gnulibc_version=''
  动态链接：
    dlsrc=dl_dlopen.xs，dlext=bundle，d_dlsymun=undef，ccdlflags=''
    cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/usr/local/lib -L/opt/local/lib'


此二进制文件的特征（来自 libperl）：
  编译时选项：PERL_MALLOC_WRAP USE_LARGE_FILES USE_PERLIO
  建于达尔文之下
  编译于 2007 年 4 月 9 日 10:36:26
  @INC：
    /usr/local/lib/perl5/5.8.8/darwin-2level
    /usr/local/lib/perl5/5.8.8
    /usr/local/lib/perl5/site_perl/5.8.8/darwin-2level
    /usr/local/lib/perl5/site_perl/5.8.8
    /usr/local/lib/perl5/site_perl

score 4 · Accepted Answer

更新：

你的时间有点可疑：

#!/usr/bin/perl

use strict;
use warnings;

my $text = '{{abcdefg}}' x 100_000;

my @match_pos;
my ($start_count, $end_count);

while ( $text =~ /({{)|(}})/g ) {
    push @match_pos, $-[0];
    if ( defined $1 ) {
        ++$start_count;
    }
    else {
        ++$end_count;
    }
}

让我们计时：

C:\Temp> timethis zxc.pl

TimeThis：命令行：zxc.pl
TimeThis：经过的时间：00:00:00.985

替换$-[0]需要length $`太长时间才能完成（我CTRL-C在一分钟后按下）。

如果我2_000复制上面的简单模式，时间最终是相同的（大约 0.2 秒）。因此，我建议使用$-[0]可扩展性。

以前的讨论

来自perldoc perlvar：

# @LAST_MATCH_START
# @-

$-[0]是最后一次成功匹配开始的偏移量。 $-[n]是与第 n 个子模式匹配的子字符串开头的偏移量，如果子模式不匹配，则为 undef。

另请参阅@+。

正则表达式中的s选项是不必要的，因为.模式中没有。

你看过Text::Balanced吗？

您也可以使用pos，尽管我不确定它是否能满足您的性能要求。

#!/usr/bin/perl

use strict;
use warnings;

use File::Slurp;

my $text = read_file \*DATA;

my @match_pos;
my ($start_count, $end_count);

while ( $text =~ /({{)|(}})/g ) {
    push @match_pos, pos($text) - 2;
    # push @match_pos, $-[0]; # seems to be slightly faster
    if ( defined $1 ) {
        ++$start_count;
    }
    else {
        ++$end_count;
    }
}

for my $i ( @match_pos ) {
    print substr($text, $i, 2), "\n";
}

__DATA__
Copy & paste the source of the complicated Wikipedia page here to test.

score 3 · Accepted Answer

$+[0]不仅仅是一个数组查找；它使用魔术接口深入研究正则表达式结果结构以查找所需的值。但我很难相信 2000 次迭代需要 2 秒。你能发布一个实际的基准吗？

您是否按照 Sinan Ünür 的建议尝试使用 pos？

更新：在我看来，字节偏移量和字符偏移量（应该被有效缓存）之间的转换可能会降低您的性能。最初尝试在您的字符串上运行 utf8::encode()，然后在需要时对捕获的单个文本片段运行 utf8::decode。

score 0 · Accepted Answer

除非您在 Wikipedia 服务器上运行它，否则网络延迟将比对脚本的调整重要一个数量级，即使那样它也将是微不足道的。

MediaWiki API和CPAN JSON 模块可能对您更有用，当然这取决于您要做什么。

regex - 有没有一种快速的方法来获取正则表达式匹配的字符索引？

4 回答 4

更新：

以前的讨论

Related

Reference