perl - Perl从html文件中提取模式

Question

我有一个充满链接的 .html 文件，我想提取没有 http:// 的域（所以只是链接的主机名部分，例如 blah.com）列出它们并删除重复项。

到目前为止，这就是我想出的-我认为问题在于我尝试传递 $tree 数据的方式

#!/usr/local/bin/perl -w

use HTML::TreeBuilder 5 -weak; # Ensure weak references in use
use URI;
  foreach my $file_name (@ARGV) {
    my $tree = HTML::TreeBuilder->new; # empty tree
    $tree->parse_file($file_name);
    my $u1 = URI->new($tree);
    print "host: ", $u1->host, "\n";
    print "Hey, here's a dump of the parse tree of $file_name:\n";

    # Now that we're done with it, we must destroy it.
    # $tree = $tree->delete; # Not required with weak references
  }

score 4 · Accepted Answer

就个人而言，我会为此使用 Mojo::DOM，并使用 URI 模块来提取域：`

  use Mojo::DOM;
  use URI;
  use List::AllUtils qw/uniq/;

  my @domains = sort +uniq
    map eval { URI->new( $_->{href} )->authority } // (),
        Mojo::DOM->new( $html_code )->find("a[href]")->each;

（PS 异常处理->authority是因为某些 URI 会在这里发声；比如 mailto:s）

score 2 · Accepted Answer

这是另一种选择：

use strict;
use warnings;
use Regexp::Common qw/URI/;
use URI;

my %hosts;

while (<>) {
    $hosts{ URI->new($1)->host }++ while /$RE{URI}{-keep}/g;
}

print "$_\n" for keys %hosts;

命令行用法：perl script.pl htmlFile1 [htmlFile2 ...] [>outFile]

您可以向脚本发送多个 html 文件。最后一个可选参数将输出定向到文件。

使用cnn.com主页作为 html 源的部分输出：

www.huffingtonpost.com
a.visualrevenue.com
earlystart.blogs.cnn.com
reliablesources.blogs.cnn.com
insideman.blogs.cnn.com
cnnphotos.blogs.cnn.com
cnnpresents.blogs.cnn.com
i.cdn.turner.com
www.stylelist.com
js.revsci.net
z.cdn.turner.com
www.cnn.com
...

希望这可以帮助！

perl - Perl从html文件中提取模式

2 回答 2

Related

Reference