regex - 使用 Perl LWP::Simple to Process Online 价格查询网站

Question

在我的空闲时间，我一直在尝试通过编写一个脚本来提高我的 perl 能力，该脚本使用 LWP::Simple 来轮询特定网站的产品页面以检查产品的价格（我有点 perl 菜鸟）。该脚本还保留了该项目最后一次价格的非常简单的积压（因为价格经常变化）。

我想知道是否有任何方法可以进一步自动化脚本，这样我就不必将页面的 URL 显式添加到初始哈希中（即保留一组关键术语并在亚马逊上进行搜索查询以查找页面或价格？）。无论如何，我可以做到这一点，而不仅仅是复制亚马逊的搜索 URL 并解析我的关键字？（我知道使用正则表达式处理 HTML 通常是不好的形式，我只是使用它，因为我只需要一小段数据）。


#!usr/bin/perl
use strict;
use warnings;
use LWP::Simple;

my %oldPrice;
my %nameURL = (
    "Archer Season 1" => "http://www.amazon.com/Archer-Season-H-Jon-Benjamin/dp/B00475B0G2/ref=sr_1_1?ie=UTF8&qid=1297282236&sr=8-1",
    "Code Complete" => "http://www.amazon.com/Code-Complete-Practical-Handbook-Construction/dp/0735619670/ref=sr_1_1?ie=UTF8&qid=1296841986&sr=8-1",
    "Intermediate Perl" => "http://www.amazon.com/Intermediate-Perl-Randal-L-Schwartz/dp/0596102062/ref=sr_1_1?s=books&ie=UTF8&qid=1297283720&sr=1-1",
    "Inglorious Basterds (2-Disc)" => "http://www.amazon.com/Inglourious-Basterds-Two-Disc-Special-Brad/dp/B002T9H2LK/ref=sr_1_3?ie=UTF8&qid=1297283816&sr=8-3"
);

if (-e "backlog.txt"){
    open (LOG, "backlog.txt");
    while(){
        chomp;
        my @temp = split(/:\s/);
        $oldPrice{$temp[0]} = $temp[1];
    }
close(LOG);
}

print "\nChecking Daily Amazon Prices:\n";
open(LOG, ">backlog.txt");
foreach my $key (sort keys %nameURL){
    my $content = get $nameURL{$key} or die;
    $content =~  m{\s*\$(\d+.\d+)} || die;
    if (exists $oldPrice{$key} && $oldPrice{$key} != $1){
        print "$key: \$$1 (Was $oldPrice{$key})\n";
    }
    else{
    print "\n$key: $1\n";
    }
    print LOG "$key: $1\n";
}
close(LOG);

score 3 · Accepted Answer

是的，设计可以改进。最好删除所有内容并从现有的功能齐全的网络抓取应用程序或框架重新开始，但既然您想学习：

名称到 URL 的映射是配置数据。从程序外部检索它。
将历史数据存储在数据库中。
学习 XPath 并使用它从 HTML 中提取数据，如果您已经熟悉 CSS 选择器，这很容易。

其他堆垛机，如果你想用每条建议的理由修改我的帖子，请继续编辑它。

score 2 · Accepted Answer

我制作了简单的脚本来演示亚马逊搜索自动化。所有部门的搜索 url 已更改为转义搜索词。其余代码是使用HTML::TreeBuilder进行简单解析。可以使用dump方法轻松检查相关 HTML 的结构（参见注释掉的行）。

use strict; use warnings;

use LWP::Simple;
use URI::Escape;
use HTML::TreeBuilder;
use Try::Tiny;

my $look_for = "Archer Season 1";

my $contents
  = get "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords="
        . uri_escape($look_for);

my $html = HTML::TreeBuilder->new_from_content($contents);
for my $item ($html->look_down(id => qr/result_\d+/)) {
    # $item->dump;      # find out structure of HTML
    my $title = try { $item->look_down(class => 'productTitle')->as_trimmed_text };
    my $price = try { $item->look_down(class => 'newPrice')->find('span')->as_text };

    print "$title\n$price\n\n";
}
$html->delete;

regex - 使用 Perl LWP::Simple to Process Online 价格查询网站

2 回答 2

Related

Reference