html - 使用 perl 从 html 页面解析域

Question

我有一个包含以下网址的 html 页面：

<h3><a href="http://site.com/path/index.php" h="blablabla">
<h3><a href="https://www.site.org/index.php?option=com_content" h="vlavlavla">

我想提取：

site.com/path
www.site.org

之间<h3><a href="& /index.php。

我试过这段代码：

#!/usr/local/bin/perl
use strict;
use warnings;

open (MYFILE, 'MyFileName.txt');
while (<MYFILE>) 
{
  my $values1 = split('http://', $_); #VALUE WILL BE: www.site.org/path/index2.php
  my @values2 = split('index.php', $values1); #VALUE WILL BE: www.site.org/path/ ?option=com_content

    print $values2[0]; # here it must print www.site.org/path/ but it don't
    print "\n";
}
close (MYFILE);

但这给出了一个输出：

它不解析 https 网站。希望你明白，问候。

score 2 · Accepted Answer

您的代码的主要问题是当您split在标量上下文中调用时，如在您的行中：

my $values1 = split('http://', $_);

它返回由split. 请参阅拆分。

但我认为split无论如何都不适合这项任务。如果您知道您要查找的值将始终位于 'http[s]://' 和 '/index.php' 之间，那么您只需要在循环中进行正则表达式替换（您还应该更加小心地打开文件。 ..）：

open(my $myfile_fh, '<', 'MyFileName.txt') or die "Couldn't open $!";
while(<$myfile_fh>) {
    s{.*http[s]?://(.*)/index\.php.*}{$1} && print;
}

close($myfile_fh);

您可能需要比这更通用的正则表达式，但我认为这将根据您对问题的描述起作用。

score 1 · Accepted Answer

这对我来说就像是模块的工作

通常使用正则表达式来解析 HTML 是有风险的。

score 0 · Accepted Answer

dms在他的回答中解释了为什么 usingsplit不是最好的解决方案：

它返回标量上下文中的项目数
普通的正则表达式更适合这项任务。

但是，我不认为基于行的输入处理对 HTML 有效，或者使用替换是有意义的（它没有意义，尤其是当模式看起来像时.*Pattern.*）。

给定一个 URL，我们可以提取所需的信息，例如

if ($url =~ m{^https?://(.+?)/index\.php}s) {  # domain+path now in $1
  say $1;
}

但是我们如何提取 URL？我会推荐美妙的 Mojolicious 套房。

use strict; use warnings;
use feature 'say';
use File::Slurp 'slurp';  # makes it easy to read files.
use Mojo;

my $html_file = shift @ARGV;  # take file name from command line

my $dom = Mojo::DOM->new(scalar slurp $html_file);

for my $link ($dom->find('a[href]')->each) {
  say $1 if $link->attr('href') =~ m{^https?://(.+?)/index\.php}s;
}

该find方法可以采用 CSS 选择器（此处：所有a具有href属性的元素）。将each结果集展平为我们可以循环的列表。

当我打印到 STDOUT 时，我们可以使用 shell 重定向将输出放入想要的文件中，例如

$ perl the-script.pl html-with-links.html >only-links.txt

整个脚本作为一个单行：

$ perl -Mojo -E'$_->attr("href") =~ m{^https?://(.+?)/index\.php}s and say $1 for x(b("test.html")->slurp)->find("a[href]")->each'

html - 使用 perl 从 html 页面解析域

3 回答 3

Related

Reference