html - 如何从带引号的可打印编码 HTML 表中提取数据？

Question

我知道还有许多其他与 HTML::TableExtract模块相关的帖子，但所有这些帖子都比我目前理解的水平高得多。我有一个来自电子邮件的非常小的表格（3 行 5 列），我想抓取第二行中的所有数据。然而，由于我对 Perl 的了解有限，我在在线阅读文档时遇到了很多麻烦。

该表如下所示：

Time      notspam    probablespam    likelyspam    spam
2012-05   10252205   62192           55995         3797710
Total     ""         ""              ""            ""

这是我要解析的代码片段。这是三行中的第二行：

<tr class=3DmailViewUnreadOdd>

<td  class=3DreportViewHeader align=3D"left">
=09
     2012-05
</td>
=20=20
=20=20=20=20
     <td align=3D'right' class=3D'mailViewRowReadEven'>
10252205
=20=20=20=20
</td>
=20=20
=20=20=20=20
     <td align=3D'right' class=3D'mailViewRowReadEven'>
62192
=20=20=20=20
</td>
=20=20
=20=20=20=20
     <td align=3D'right' class=3D'mailViewRowReadEven'>
55995
=20=20=20=20
</td>
=20=20
=20=20=20=20
     <td align=3D'right' class=3D'mailViewRowReadEven'>
3797710
=20=20=20=20
</td>
=20=20
</tr>

这是我到目前为止所尝试的。我在 HTML::TableExtract 页面上使用了一个示例，并对其进行了修改以满足我的需要。但它没有返回任何东西：

use HTML::TableExtract;
my $te = HTML::TableExtract->new(
    headers => [qw(notspam  probablespam  likelyspam  spam)]);
my $html = 'test.html';
$te->parse($html);
# Examine all matching tables
foreach $ts ($te->tables) {
    print "Table (", join(',', $ts->coords), "):\n";
    foreach $row ($ts->rows) {
        print join(',', @$row), "\n";
    }
}

我想提取日期（2012-05）和数字（10252205、62192、55995、3797710）并将它们存储在变量中。我应该使用深度和计数参数提取数据吗？

score 0 · Accepted Answer

这适用于您的示例数据。（当针对完整的电子邮件运行时，它可能会捕获太多内容，但我只能使用部分 HTML 来做这些。）

use strictures;
use File::Slurp qw(read_file);
use MIME::QuotedPrint qw(decode_qp);
use Web::Query qw();

my $w = Web::Query->new_from_html(decode_qp read_file 'so10883053.html');
my @data = $w->find('.mailViewUnreadOdd > *')->text;
# (
#     " 2012-05 ",
#       10252205 ,
#          62192 ,
#          55995 ,
#        3797710
# )

不要像我在代码中展示的那样手动解码电子邮件，而应该使用非常高级的解析器，例如Courriel。

html - 如何从带引号的可打印编码 HTML 表中提取数据？

1 回答 1

Related

Reference