1

举个例子:

我从 .txt 加载输入:

本杰明,Schuvlein,德国,1912,M,White

我做了一些代码,为了简洁起见,我不会在这里发布并访问链接:

https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ

  1. 我想从该页面上抓取多个内容。在下面的代码中,我只做 1。
  2. 我还想让每个项目在输出 .txt 中用 , 分隔。
  3. 而且,我希望输出之前是输入。

我在代码中使用了以下包:

use strict;
use warnings;
use WWW::Mechanize::Firefox;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;

以下是相关代码:

my $ua = LWP::UserAgent->new;
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
# Here is the url, although in practice, it is scraped itself using different code
my $url = 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ'; 
print "My URL is <$url>\n";  
my $request = HTTP::Request->new(GET => $url);
  $request->push_header('Content-Type' => 'application/json');
  my $response = $ua->request($request);
 die "Error ".$response->code if !$response->is_success;
 my $dom_tree = new HTML::DOM;
 $dom_tree->write($response->content);
 $dom_tree->close;
  my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[10]->as_text();
 print $str;
print $o $str;

所需的输出(来自该链接)类似于:

Benjamin,Schuvlein,德国,1912,M,White,Queens,New York,Married,Same Place,Head,等等......

(该输出部分中有多少是可抓取的?)

任何有关如何在链接中获取链接的帮助将不胜感激!

4

3 回答 3

2

尝试这个

use LWP::Simple;
use LWP::UserAgent;
use HTML::TableExtract;

$ENV{'PERL_LWP_SSL_VERIFY_HOSTNAME'} = 0;
$ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.91 Safari/537.11");
$req = HTTP::Request->new(GET => "https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ");
$res = $ua->request($req);
$content = $res->content;
#$content = get("https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ") or die "Couldn't get it! $!";
$te = HTML::TableExtract->new( attribs => { 'class' => 'result-data' } );
# $te = HTML::TableExtract->new( );
$te->parse($content);
$table = $te->first_table_found;
# print $content; exit;
# $te->tables_dump(1);
#print Dumper($te);
#print Dumper($table);
print $table->cell(4,0) . ' = ' . $table->cell(4,1), "\n"; exit;

哪个打印出来

活动地点: = Assembly District 2, Queens, New York City, 皇后区, 纽约, 美国

我还注意到这个标题:

X-Copyright:COPYRIGHT WARNING 可通过 FamilySearch API 访问的数据受版权保护。未经许可,禁止对这些数据进行任何程序化访问、重新格式化或重新路由。FamilySearch 认为此类未经授权的使用侵犯了其复制、派生和分发权。如需更多信息,请联系 devnet (at) familysearch.org。

另请参阅http://metacpan.org/pod/HTML::Element#SYNOPSIS

于 2013-02-12T02:37:16.617 回答
2

使用HTML::TreeBuilder::XPath访问 HTML 很简单。该程序使用标签作为键构建数据的哈希,因此可以提取任何所需的信息。我已将任何包含逗号或空格的字段括在引号中。

我不知道你是否有这个网站的许可以这种方式提取数据,但我应该提请你注意X-CopyrightHTTP 响应中的这个标头。这种方法显然属于程序访问的标题。

X-版权:版权警告可通过 FamilySearch API 访问的数据受版权保护。未经许可,禁止对这些数据进行任何程序化访问、重新格式化或重新路由。FamilySearch 认为此类未经授权的使用侵犯了其复制、派生和分发权。如需更多信息,请联系 devnet (at) familysearch.org。

我应该期待您的电子邮件吗?我回复了你的第一封邮件,但从那以后就再也没有听到过。

use strict;
use warnings;

use URI;
use LWP;
use HTML::TreeBuilder::XPath;

my $url = URI->new('https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ');

my $ua = LWP::UserAgent->new;
my $resp = $ua->get($url);
die $resp->status_line unless $resp->is_success;

my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->decoded_content);
my @results = $tree->findnodes('//table[@class="result-data"]//tr[@class="result-item"]');
my %data;
for my $item (@results) {
  my ($key, $val) = map $_->as_trimmed_text, $item->content_list;
  $key =~ s/:$//;
  $data{$key} = $val;
}

my $record = join ',', map { local $_ = $data{$_}; /[,\s]/ ? qq<"$_"> : $_ }
  'name', 'birthplace', 'estimated birth year', 'gender', 'race (standardized)',
  'event place', 'marital status', 'residence in 1935',
  'relationship to head of household (standardized)';

print $record, "\n";

输出

"Benjamin Schuvlein",Germany,1912,Male,White,"Assembly District 2, Queens, New York City, Queens, New York, United States",Married,"Same Place",Head
于 2013-02-13T16:24:44.480 回答
0

我以为我已经回答了你的问题。

问题是您正在尝试使用 LWP 获取网页。如果您已经拥有 WWW::Mechanize::Firefox,为什么还要尝试这样做?

你试过这个吗?

它将检索并保存每个链接以供进一步分析。一个小小的改变,你就“得到”了 DOM 树。抱歉,我无权访问此页面,所以我只希望它能正常工作。

my $i=1;
for my $link (@links) {
  print Dumper $link->url;
  print Dumper $link->text;
  my $tempfile = './$i.html';$i++;
  $mech->get( $link, ':content_file' => $tempfile, synchronize => 1 );
  my $dom_tree = $mech->document();
  my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[9]->as_text();

 }

编辑:使用正则表达式处理页面内容(每个人:请记住,总是有不止一种方法可以用 Perl 做某事!。它有效,很容易......)

它用这个 cmd 试了一下:

wget -nd ' https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ ' -O 1.html|cat 1.html|1.pl

use Data::Dumper;
use strict;
use warnings;

local $/=undef;
my $html = <>;#read from file
#$html = $mech->content( format => 'html' );# read data from mech object
my $data = {};
my $current_label = "not_defined";
while ($html =~ s!(<td[^>]*>.*?</td>)!!is){ # process each TD
    my $td = $1;
    print "td: $td\n";
    my $td_val = $td;
    $td_val =~ s!<[^>]*>!!gis;
    $td_val =~ s!\s+! !gs;
    $td_val =~ s!(\A\s+|\s+\z)!!gs;
    if      ($td =~ m!result-label!){ #primitive state machine, store the current label
        print "current_label: $current_label\n";
        $current_label = $td_val;
    } elsif ($td =~ m!result-value!){ #add each data to current label
        push(@{$data->{$current_label}},$td_val);

    } else {
        warn "found something else: $td\n";
    }
}
#process it using a white lists of known entries (son,race, etc).Delete from the result if you find it on white list, die if you find something new.
#multi type
foreach my $type (qw(son wife daughter head)){
    process_multi($type,$data->{$type});
    delete($data->{$type});
}
#simple type
foreach my $type (qw(birthplace age)){
    process_simple($type,$data->{$type});
    delete($data->{$type});
}

die "Unknown label!".Dumper($data) if scalar(keys %{$data})>0;

输出:

      'line number:' => [
                          '28'
                        ],
      'estimated birth year:' => [
                                   '1912'
                                 ],
      'head' => [
                  'Benjamin Schuvlein',
                  'M',
                  '28',
                  'Germany'
                ],
于 2013-02-11T15:09:50.283 回答