-1

我正在尝试运行一个 perl 脚本(在 Windows cmd 窗口中),但它总是会在某个点停止工作。我怎样才能找出为什么它不会继续?

这是脚本:我可以看到执行的最后一件事是第 37 行中的“get_html_source()”

#!/usr/bin/perl
# Perl script that scrapes the members of the Hellenic Parliament
# Created by Kostas Ntonas, 03 May 2013 - http://ntonas.gr
# http://deixto.blogspot.gr/2013/05/scraping-members-of-greek-parliament.html

use strict;
use warnings;
use utf8;

use IO::File;
use POSIX qw(tmpnam);
use DEiXToBot;
use WWW::Selenium;

my $agent = DEiXToBot->new(); # create the DEiXToBot agent object

# launch a Firefox instance
my $sel = WWW::Selenium->new( host => "localhost",
                              port => 4444,
                              browser => "*firefox",
                              browser_url => "http://www.hellenicparliament.gr/"
                            );
$sel->start;

for my $i (1..30) {

    my $url = "http://www.hellenicparliament.gr/en/Vouleftes/Viografika-Stoicheia?pageNo=$i";

    $sel->open($url);

    $sel->wait_for_page_to_load(5000);

    $sel->pause(1);

    print "$i) $url\n";

    my $content = $sel->get_html_source();

    my ($fh,$name); # create a temporary file containing the page's source code
    do { $name = tmpnam() } until $fh = IO::File->new($name, O_RDWR|O_CREAT|O_EXCL);
    binmode( $fh, ':utf8' );
    print $fh $content;
    close $fh;

    $agent->get("file://$name"); # load the temporary file/page with the DEiXToBot agent using the file:// scheme

    unlink $name; # delete the temporary file, it is not needed any more

    if (! $agent->success) { die "Could not fetch the temp file!\n"; }

    $agent->build_dom();

    $agent->load_pattern('C:\Users\XXX\Documents\Privat\MyCase3\Deixto Patterns\parliament_CVs.xml');

    $agent->extract_content();

    if (! $agent->hits) {
        die "Could not find any MPs/ records!\n";
    }
    else {
        for my $record ($agent->records) {
            my @rec = @$record;

            my $party;
            my $logo = $rec[0];

            # deduce the party name from the logo in the first column of the table
            if ($logo=~m#ND_Logo#) { $party = "N.D. (New Democracy)"; }
            elsif ($logo=~m#COALITION#) { $party = "SYRIZA Unitary Social Front"; }
            elsif ($logo=~m#PASOK#) { $party = "PA.SO.K. (Panhellenic Socialist Movement)"; }
            elsif ($logo=~m#ANEKS_ELL#) { $party = "ANEXARTITOI ELLINES (Independent Hellenes)"; }
            elsif ($logo=~m#xrisi#) { $party = "LAIKOS SYNDESMOS - CHRYSI AVGI (People's Association - Golden Dawn)"; }
            elsif ($logo=~m#small#) { $party = "DHM.AR (Democratic Left)"; }
            elsif ($logo=~m#KKE#) { $party = "K.K.E. (Communist Party of Greece)"; }
            elsif ($logo=~m#INDEPENDENT#) { $party = "INDEPENDENT"; }
            else { die "$logo => Unknown logo!\n"; }

            $rec[0] = $party;

            $rec[3]=~s#\s+# #g; # replace whitespace characters with a single space

            # append the data in a tab delimited text file
            open my $fh,">>:utf8","MPs.txt";
            print $fh join("\t",@rec)."\n";
            close $fh;
        }
    }
}

$sel->stop;
4

2 回答 2

0

tmpnam 函数由 POSIX Perl 模块提供。它应该在大多数 Unix/Linux 变体上都能正常工作,但在 Windows 下似乎被破坏了。我建议用以下内容替换包含 tmpnam 调用的“有问题的”行:

use File::Temp qw/ tempfile /;
($fh,$name) = tempfile();

希望此更改将解决问题并允许脚本完成。

这也是 Perl tmpnam 文档 ( http://perldoc.perl.org/POSIX.html ) 建议的内容:“出于安全原因,可能在 C 库 tmpnam() 函数的系统文档中有详细说明,此接口应该不被使用;而是参见 File::Temp"。

于 2013-11-02T17:05:08.783 回答
0

您是否知道代码在 get_html_source 中正在死去,或者它实际上是在之前或之后立即死去(例如,在对 tmpnam 的调用中,它似乎缺少一个分号)?

另一个评论是,这似乎只是为了抓取国会议员及其政党的名单而做的大量工作。如果您查看页面源代码,就会发现一大块 base-64 编码文本似乎包含您需要的所有数据。因此,您可能会发现加载页面、解码块并拥有所需的一切更快。

于 2013-10-23T21:14:19.677 回答