regex - 正则表达式比可用文本获得更多结果

Question

我有一个非常奇怪的问题：我在 html 网站上搜索 URL，并且只想要 url 的特定部分。在我的测试 html 页面中，链接只出现一次，但我得到的不是一个结果，而是大约 20...

这是我正在使用的正则表达式：

perl -ne 'm/http\:\/\myurl\.com\/somefile\.php.+\/afolder\/(.*)\.(rar|zip|tar|gz)/; print "$1.$2\n";'

示例输入将是这样的：

<html><body><a href="http://myurl.com/somefile.php&x=foo?y=bla?z=sdf?path=/foo/bar/afolder/testfile.zip?more=arguments?and=evenmore">Somelinknme</a></body></html>

这是一个非常简单的例子。所以实际上链接会出现在一个普通的网站上，内容围绕着......

我的结果应该是这样的：

testfile.zip

但相反，我经常看到这条线......这是正则表达式的问题还是其他问题？

score 5 · Accepted Answer

是的，正则表达式是贪婪的。

使用适当的 HTML 工具代替：HTML::LinkExtor或WWW::Mechanize 中的链接方法之一，然后使用URI提取特定部分。

use 5.010;
use WWW::Mechanize qw();
use URI qw();
use URI::QueryParam qw();

my $w = WWW::Mechanize->new;
$w->get('file:///tmp/so10549258.html');
for my $link ($w->links) {
    my $u = URI->new($link->url);
    # 'http://myurl.com/somefile.php?x=foo&y=bla&z=sdf&path=/foo/bar/afolder/testfile.zip&more=arguments&and=evenmore'
    say $u->query_param('path');
    # '/foo/bar/afolder/testfile.zip'
    $u = URI->new($u->query_param('path'));
    say (($u->path_segments)[-1]);
    # 'testfile.zip'
}

score 1 · Accepted Answer

链接后的文件中是否有 20 行？

你的问题是匹配的变量没有被重置。您第一次匹配您的链接，$1并$2获得它们的值。在以下几行中，正则表达式不匹配，但$1仍然$2具有旧值，因此您应该仅在正则表达式匹配时才打印，而不是每次都打印。

从perlre，请参阅捕获组部分

注意：Perl 中失败的匹配不会重置匹配变量，这样可以更轻松地编写代码来测试一系列更具体的情况并记住最佳匹配。

score -2 · Accepted Answer

这应该为您的示例输入和输出解决问题。

$Str = '<html><body><a href="http://myurl.com/somefile.php&x=foo?y=bla?z=sdf?path=/foo/bar/afolder/testfile.zip?more=arguments?and=evenmore">Somelinknme</a></body></html>';

@Matches = ($Str =~ m#path=.+/(\w+\.\w+)#g);
print @Matches ;

regex - 正则表达式比可用文本获得更多结果

3 回答 3

Related

Reference