regex - Perl 正则表达式为 Spamassassin 排除某些 TLD

Question

我根本无法用 Perl 编写代码；所以，看起来很简单的事情——编写一个正则表达式来为所有不用于“com”或“net”或“org”TLD 的 URI 评分——显然超出了我的技能范围。有人可以启发我吗？

作为一个例子，我想https://foo.com.us/asdf?qwerty=123匹配和ftp://madeup.kernel.org/path/to/some/tarball.tar.bz2不匹配。

score 2 · Accepted Answer

正则表达式模式

//(?:[a-z]+\.)*+(?!com/|net/|org/)

应该做你想做的。斜线是模式的一部分，不是分隔符

这是一个演示

use strict;
use warnings;
use 5.010;

my @urls = qw{
    https://foo.com.us/asdf?qwerty=123
    ftp://madeup.kernel.org/path/to/some/tarball.tar.bz2
};

for ( @urls ) {
    say m{//(?:[a-z]+\.)*+(?!com/|net/|org/)} ? 'match' : 'no match';
}

输出

match
no match

score 0 · Accepted Answer

您应该使用该URI模块将主机名与 URL 的其余部分分开

此示例仅提取主机名的最后一个子字符串，因此它将查看例如ukfrom bbc.co.uk，但它应该符合您的目的

use strict;
use warnings;

use URI;

my @urls = qw{
    https://foo.com.us/asdf?qwerty=123
    ftp://madeup.kernel.org/path/to/some/tarball.tar.bz2
};

for my $url ( @urls ) {
    $url = URI->new($url);
    my $host = $url->host;
    my ($tld) = $host =~ /([^.]+)\z/;

    if ( $tld !~ /^(?com|net|org)\z/ ) {
        # non-standard TLD
    }
}

regex - Perl 正则表达式为 Spamassassin 排除某些 TLD

2 回答 2

输出

Related

Reference