perl - Perl：如何使用 HTML::TableExtract 提取和组合表格的两个子部分？

Question

我正在尝试抓取并提取多页数据保存表，显示每个月进口到中国的各种商品的价值和数量（数量）。最终，我想将此数据写入文本文件以进行进一步处理。这是这样一个页面的截图。

中文数据导入

具体我想提取进口货物的名称、体积单位（如吨、公斤）、实际值和体积，共4个字段。我遇到的问题是我要提取的表似乎处于不同的深度。

我可以提取 'volume' 和 'value' 的字段，因为它们的深度相同，所以我得到如下调试输出：

2,314 --- 2,943
20,082 --- 80,176
7 --- 62,426

“名称”和“单位”字段与“体积”和“值”字段处于不同级别（我认为），因此当我对所有 4 个字段使用标题时，它们不会被拾取。但是，如果我尝试将它们提取为子表，它可以正常工作，并提供以下调试输出：

啤酒 --- 千升
葡萄酒 --- 千升
饲料用鱼粉 --- 万吨

我应该如何解决这个问题？我的第一个想法是分别提取每个表，遍历每个表的每一行，将一个表中的 2 个字段和另一个表中的 2 个字段添加到每行有 4 个元素的数组中。（R我想我会创建一个数据框并cbind用于此。）这似乎可行，但感觉不是最佳的。所以首先我想问：

1）有没有一种直接的方法来告诉HTML::TableExtract提取表的两个子集并将它们组合起来？

2）如果我必须将数据提取为两个单独的表并将它们组合起来，那么最有效的方法是什么？

我到目前为止的代码如下：

use strict;
use HTML::TableExtract;
use Encode;
use utf8;
use WWW::Mechanize;
use Data::Dumper;

binmode STDOUT, ":utf8";

# Chinese equivalents of the various headings
my $txt_header = "单位：千美元";
my $txt_name = "商品名称";
my $txt_units = "计量单位";
my $txt_volume = "数量";
my $txt_value = "金额";

# Chinese Customs site
my $url = "http://www.chinacustomsstat.com/aspx/1/newdata/record_class.aspx?page=2&guid=951";

my $mech = WWW::Mechanize->new( agent => 'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)');
my $page = $mech->get( $url );
my $htmlstuff = $mech->content();

print ("\nFirst table with two headers (volume and value) at same depth\n\n");
my $te = new HTML::TableExtract( depth => 1,  headers => [ ( $txt_volume, $txt_value ) ]);
$te->parse($htmlstuff);

# See what we have
foreach my $ts ( $te->tables ) {
    print "Table (", join( ',', $ts->coords ), "):\n";
    foreach my $row ( $ts->rows ) {
        print join( ' --- ', @$row ), "\n";
    }
}

print ("\nSecond table with 'name' and 'units'\n");

$te = new HTML::TableExtract( headers => [ ( $txt_name, $txt_units ) ]);
$te->parse($htmlstuff);

# See what we have in the other table
foreach my $ts ( $te->tables ) {
    print "Table (", join( ',', $ts->coords ), "):\n";
    foreach my $row ( $ts->rows ) {
        print join( ' --- ', @$row ), "\n";
    }
}

score 2 · Accepted Answer

看起来该表确实是实现了嵌套表。在这种情况下，在我看来，您的应用程序将需要通过单独解析表格并将结果拼接在一起来弥补这一点。

我对您的示例代码采取了一些自由，并提出了这个：

#!/usr/bin/perl

use utf8;
use strict;
use warnings;
{
    use WWW::Mechanize;
    use HTML::TableExtract;
    use Data::Dumper;
    use Text::FormatTable;
}

binmode STDOUT, ':utf8';

my $txt_name   = '商品名称';
my $txt_units  = '计量单位';
my $txt_volume = '数量';
my $txt_value  = '金额';

my $url
    = 'http://www.chinacustomsstat.com'
    . '/aspx/1/newdata/record_class.aspx'
    . '?page=2&guid=951';

my $mech = WWW::Mechanize->new(
    agent => 'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)' );

my $page = $mech->get($url);
my $html = $mech->content();

my %data_for;
{
    my %config_for = (
        products => {
            values  => [],
            headers => [ $txt_name, $txt_units ],
        },
        data => {
            values  => [],
            headers => [ $txt_volume, $txt_value ],
        },
    );

    for my $type ( keys %config_for ) {

        my $config_rh = $config_for{$type};

        my $te = HTML::TableExtract->new( headers => $config_rh->{headers} );

        $te->parse($html);

        for my $ts ( $te->tables() ) {

            for my $row_ra ( $ts->rows() ) {

                if ( defined $row_ra->[0] ) {

                    push @{ $config_rh->{values} }, $row_ra;
                }
            }
        }
    }

    if ( @{ $config_for{products}->{values} }
        != @{ $config_for{data}->{values} } )
    {
        warn 'not as many value rows were parsed as product rows';
    }

    for my $i ( 0 .. $#{ $config_for{products}->{values} } ) {

        my $product_ra = $config_for{products}->{values}->[$i];
        my $data_ra    = $config_for{data}->{values}->[$i];

        my ( $product, $units ) = @{$product_ra};
        my ( $volume,  $value ) = @{$data_ra};

        $data_for{$product} = {
            units  => $units,
            volume => $volume,
            value  => $value,
        };
    }
}

# process results in %data_for hash
{
    my $table = Text::FormatTable->new('| l | l | l | l |');

    $table->head( $txt_name, $txt_units, $txt_volume, $txt_value, );
    $table->rule('=');

    for my $product ( keys %data_for ) {

        $table->row(
            $product,
            @{ $data_for{$product} }{qw( units volume value )}
        );
        $table->rule('-');
    }

    print $table->render();
}

__END__

我对 Text::FormatTable 处理（或不处理）宽字符的方式有点失望。但我认为这与这个例子无关。

perl - Perl：如何使用 HTML::TableExtract 提取和组合表格的两个子部分？

1 回答 1

Related

Reference