regex - 如何使用正则表达式从 Perl 中的 URL 获取主机名？

Question

所以我想要做的是删除所有内容，包括出现在“。”之后的第一个“/”。所以： http: //linux.pacific.net.au/primary.xml.gz 会变成： http: //linux.pacific.net.au

我如何使用正则表达式来做到这一点？我运行的系统无法使用 URI 工具。

score 6 · Accepted Answer

$url = 'http://linux.pacific.net.au/primary.xml.gz';
($domain) = $url =~ m!(https?://[^:/]+)!;
print $domain;

输出：

http://linux.pacific.net.au

这是可用于解码 URI 的官方正则表达式：

  my($scheme, $authority, $path, $query, $fragment) =
  $uri =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;

score 5 · Accepted Answer

我建议您使用URI::Split它将标准 URL 分成其组成部分并重新加入它们。您需要前两部分 -方案和主机。

use strict;
use warnings;

use URI::Split qw/ uri_split uri_join /;

my $scheme_host = do {
  my (@parts) = uri_split 'http://linux.pacific.net.au/primary.xml.gz';
  uri_join @parts[0,1];
};

print $scheme_host;

输出

http://linux.pacific.net.au

更新

如果您的评论我正在运行的系统无法使用 URI 工具意味着您无法安装模块，那么这里是一个正则表达式解决方案。

你说你想删除之后的所有内容，包括出现在“.”之后的第一个“/”。, 所以/^.*?\./找到第一个点，然后m|[^/]+|找到下一个斜线之后的所有内容。

输出与前面代码的输出相同

use strict;
use warnings;

my $url = 'http://linux.pacific.net.au/primary.xml.gz';

my ($scheme_host) = $url =~ m|^( .*?\. [^/]+ )|x;

print $scheme_host;

score 4 · Accepted Answer

The system I'm running on can't use URI tool.

I really recommend doing whatever you can to fix that problem first. If you're not able to use CPAN modules then you'll be missing out on a lot of the power of Perl and your Perl programming life will be far more frustrating than it needs to be.

regex - 如何使用正则表达式从 Perl 中的 URL 获取主机名？

3 回答 3

Related

Reference