


我可以提取 'volume' 和 'value' 的字段,因为它们的深度相同,所以我得到如下调试输出:

2,314 --- 2,943
20,082 --- 80,176
7 --- 62,426

“名称”和“单位”字段与“体积”和“值”字段处于不同级别(我认为),因此当我对所有 4 个字段使用标题时,它们不会被拾取。但是,如果我尝试将它们提取为子表,它可以正常工作,并提供以下调试输出:

啤酒 --- 千升
葡萄酒 --- 千升
饲料用鱼粉 --- 万吨

我应该如何解决这个问题?我的第一个想法是分别提取每个表,遍历每个表的每一行,将一个表中的 2 个字段和另一个表中的 2 个字段添加到每行有 4 个元素的数组中。(R我想我会创建一个数据框并cbind用于此。)这似乎可行,但感觉不是最佳的。所以首先我想问:




use strict;
use HTML::TableExtract;
use Encode;
use utf8;
use WWW::Mechanize;
use Data::Dumper;

binmode STDOUT, ":utf8";

# Chinese equivalents of the various headings
my $txt_header = "单位:千美元";
my $txt_name = "商品名称";
my $txt_units = "计量单位";
my $txt_volume = "数量";
my $txt_value = "金额";

# Chinese Customs site
my $url = "http://www.chinacustomsstat.com/aspx/1/newdata/record_class.aspx?page=2&guid=951";

my $mech = WWW::Mechanize->new( agent => 'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)');
my $page = $mech->get( $url );
my $htmlstuff = $mech->content();

print ("\nFirst table with two headers (volume and value) at same depth\n\n");
my $te = new HTML::TableExtract( depth => 1,  headers => [ ( $txt_volume, $txt_value ) ]);

# See what we have
foreach my $ts ( $te->tables ) {
    print "Table (", join( ',', $ts->coords ), "):\n";
    foreach my $row ( $ts->rows ) {
        print join( ' --- ', @$row ), "\n";

print ("\nSecond table with 'name' and 'units'\n");

$te = new HTML::TableExtract( headers => [ ( $txt_name, $txt_units ) ]);

# See what we have in the other table
foreach my $ts ( $te->tables ) {
    print "Table (", join( ',', $ts->coords ), "):\n";
    foreach my $row ( $ts->rows ) {
        print join( ' --- ', @$row ), "\n";

1 回答 1





use utf8;
use strict;
use warnings;
    use WWW::Mechanize;
    use HTML::TableExtract;
    use Data::Dumper;
    use Text::FormatTable;

binmode STDOUT, ':utf8';

my $txt_name   = '商品名称';
my $txt_units  = '计量单位';
my $txt_volume = '数量';
my $txt_value  = '金额';

my $url
    = 'http://www.chinacustomsstat.com'
    . '/aspx/1/newdata/record_class.aspx'
    . '?page=2&guid=951';

my $mech = WWW::Mechanize->new(
    agent => 'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)' );

my $page = $mech->get($url);
my $html = $mech->content();

my %data_for;
    my %config_for = (
        products => {
            values  => [],
            headers => [ $txt_name, $txt_units ],
        data => {
            values  => [],
            headers => [ $txt_volume, $txt_value ],

    for my $type ( keys %config_for ) {

        my $config_rh = $config_for{$type};

        my $te = HTML::TableExtract->new( headers => $config_rh->{headers} );


        for my $ts ( $te->tables() ) {

            for my $row_ra ( $ts->rows() ) {

                if ( defined $row_ra->[0] ) {

                    push @{ $config_rh->{values} }, $row_ra;

    if ( @{ $config_for{products}->{values} }
        != @{ $config_for{data}->{values} } )
        warn 'not as many value rows were parsed as product rows';

    for my $i ( 0 .. $#{ $config_for{products}->{values} } ) {

        my $product_ra = $config_for{products}->{values}->[$i];
        my $data_ra    = $config_for{data}->{values}->[$i];

        my ( $product, $units ) = @{$product_ra};
        my ( $volume,  $value ) = @{$data_ra};

        $data_for{$product} = {
            units  => $units,
            volume => $volume,
            value  => $value,

# process results in %data_for hash
    my $table = Text::FormatTable->new('| l | l | l | l |');

    $table->head( $txt_name, $txt_units, $txt_volume, $txt_value, );

    for my $product ( keys %data_for ) {

            @{ $data_for{$product} }{qw( units volume value )}

    print $table->render();


我对 Text::FormatTable 处理(或不处理)宽字符的方式有点失望。但我认为这与这个例子无关。

于 2012-11-18T04:17:53.440 回答