html - 如何使用 Perl 提取 HTML 标题？

Question

有没有办法使用 Perl 提取 HTML 页面标题？我知道它可以在表单提交期间作为隐藏变量传递，然后以这种方式在 Perl 中检索，但我想知道是否有办法在没有提交的情况下做到这一点？

就像，假设我有一个这样的 HTML 页面：

<html><head><title>TEST</title></head></html>

然后在 Perl 中我想做：

$q -> h1('something');

如何用 <title> 标签中包含的内容动态替换“某物”？

score 8 · Accepted Answer

我会使用pQuery。它就像 jQuery 一样工作。

你可以说：

use pQuery;
my $page = pQuery("http://google.com/");
my $title = $page->find('title');
say "The title is: ", $title->html;

替换东西是类似的：

$title->html('New Title');
say "The entirety of google.com with my new title is: ", $page->html;

您可以将 HTML 字符串传递给pQuery构造函数，这听起来像是您想要做的。

最后，如果您想使用任意 HTML 作为“模板”，然后使用 Perl 命令“优化”它，您需要使用Template::Refine。

score 3 · Accepted Answer

3

HTML::HeadParser会为你做这件事。

于 2009-02-23T18:41:21.303 回答

score 1 · Accepted Answer

我不清楚你在问什么。您似乎在谈论可以在用户浏览器中运行的东西，或者至少是已经加载了 html 页面的东西。

如果不是这样，答案是URI::Title。

score 1 · Accepted Answer

use strict;
use LWP::Simple;

my $url = 'http://www.google.com'|| die "Specify URL on the cmd line";
my $html = get ($url);
$html =~ m{<TITLE>(.*?)</TITLE>}gism;

print "$1\n";

score 1 · Accepted Answer

前面的答案是错误的，如果更频繁地使用 HTML 标题标签，那么可以通过检查以确保标题标签有效（中间没有标签）来轻松解决这个问题。

my ($title) = $test_content =~ m/<title>([a-zA-Z\/][^>]+)<\/title>/si;

score 0 · Accepted Answer

从文件中获取标题名称。

                    my $spool = 0;

                    open my $fh, "<", $absPath or die $!; 
                    #open ($fh, "<$tempfile" );
                    # wrtie the opening brace
                    print WFL "[";
            while (<$fh>) {
                    # removes the new line from the line read
                        chomp;
                    # removes the leading and trailing spaces.
                    $_=~ s/^\s+|\s+$//g;
            # case where the <title> and </title> occures in one line
            # we print and exit in one instant
                if (($_=~/$startstring/i)&&($_=~/$endstring/i)) {

                        print WFL "'";

                    my ($title) = $_=~ m/$startstring(.+)$endstring/si;
                        print WFL "$title";
                        print WFL "',";
                        last;
                        }
            # case when the <title> is in one line and </title> is in other line

            #starting <title> string is found in the line
                elsif ($_=~/$startstring/i) {

                        print WFL "'";
            # extract everything after <title> but nothing before <title>       
                    my ($title) = $_=~ m/$startstring(.+)/si;
                        print WFL "$title";
                        $spool = 1;
                        }
            # ending string </title> is found
                elsif ($_=~/$endstring/i) {
            # read everything before </title> and nothing above that                                
                    my ($title) = $_=~ m/(.+)$endstring/si;
                        print WFL " ";
                        print WFL "$title";
                        print WFL "',";
                        $spool = 0;
                        last;
                        }
            # this will useful in reading all line between <title> and </title>
                elsif ($spool == 1) {
                        print WFL " ";
                        print WFL "$_";

                        }

                    }
        close $fh;
        # end of getting the title name

score -2 · Accepted Answer

如果您只想提取页面标题，可以使用正则表达式。我相信这会是这样的：

my ($title) = $html =~ m/<title>(.+)<\/title>/si;

您的 HTML 页面存储在字符串中的位置$html。在si中，s代表单行模式（即，点也匹配换行符）和i忽略大小写。

html - 如何使用 Perl 提取 HTML 标题？

7 回答 7

从文件中获取标题名称。

Related

Reference