perl - 使用 UserAgent 的 Perl 问题在循环中获取网站

Question

我可以很好地抓取第一张图片，但是内容似乎在其内部循环。不知道我做错了什么。

#!/usr/bin/perl
use LWP::Simple;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
for(my $id=1;$id<55;$id++)
{
    my $response = $ua->get("http://www.gamereplays.org/community/index.php?act=medals&CODE=showmedal&MDSID=" . $id );
    my $content = $response->content;    
        for(my $id2=1;$id2<10;$id2++)
        {
                $content =~ /<img src="http:\/\/www\.gamereplays.org\/community\/style_medals\/(.*)$id2\.gif" alt=""\/>/;
                $url = "http://www.gamereplays.org/community/style_medals/" . $1 . $id2 . ".gif";
  print "--\n\r";
  print "ID: ".$id."\n\r";
  print "ID2: ".$id2."\n\r";
  print "URL: ".$url."\n\r";
  print "1: ".$1."\n\r";
  print "--\n\r";
  getstore($url, $1 . $id2 . ".gif");
        }
}

score 1 · Accepted Answer

问题出在您的正则表达式中。(.*)是贪心的，它将匹配和之间的所有style_medals/字符$id2.gif。什么时候$id2是 1，这很好，但是什么时候$id2是 2，它将匹配直到的所有内容2.gif，其中包括来自的完整字符串1.gif。

尝试(.*)通过添加?non-greedy 修饰符来使 non-greedy: (.*?). 这应该可以解决您的问题。

编辑：理想情况下，您不会使用正则表达式来解析 HTML，而是使用诸如HTML::Parser.

score 1 · Accepted Answer

正如其他人所说，这确实是 HTML::Parser 的工作。此外，您应该“使用严格；” 并删除使用 LWP::Simple，因为您没有使用该库。

您可以将您的正则表达式更改为以下内容：

$content =~ m{http://www\.gamereplays\.org/community/style_medals/([\w\_]+)$id2\.gif}s;

但是你不会得到 style_medals/comp_graphics_10.gif - 这可能是你想要的。我认为像下面这样的东西会更好。我对风格的改变表示歉意，但我无法抗拒为 PBP 进行修改。

#!/usr/bin/perl                                                                 

use LWP::UserAgent;
use Carp;
use strict;

my $ua = LWP::UserAgent->new();

# Fetch pages from 1 to 55.  Are we sure we won't have page 56?                 
# Perhaps consider running until a 404 is found.                                
for (my $id = 1; $id < 55; $id++) {

    # Get the page data                                                         
    my $response = $ua->get( 'http://www.gamereplays.org/community/index.php?ac\
t=medals&CODE=showmedal&MDSID='.$id );

    # Check for failure and abort                                               
    if (!defined $response || !$response->is_success) {
        croak 'Request failed! '.$response->status_line();
    }

    my $content = $response->content();

    # Run this loop each time we find the url                                   
  CONTENT_LOOP:
    while ($content =~ s{<img src="(http://www\.gamereplays\.org/community/styl\
e_medals/([^\"]+))" }{}ms) {

        my $url   = $1;  # The entire url, no need to recreate the domain       
        my $file  = $2;  # Just the file name portion                           
        my ($id2) = $file =~ m{ _(\d+)\.gif \Z}xms; # extract id2 for debug     

        next CONTENT_LOOP if !defined $id2;         # Handle SOTW.gif file(s)   

        # Display stats about each id found                                     
        print "--\n";
        print "ID:  $id\n";
        print "ID2: $id2\n";
        print "URL: $url\n";
        print "1:   $file\n";
        print "--\n";

        # You might want to consider involving the $id in the filename as       
        # you could have the same filename on multiple pages                    
        getstore( $url, $file);
    }
}

score 0 · Accepted Answer

我不会推动 HTML 解析模块（尽管 LinkExtor在这里可以成为你的朋友......），因为我理解 HTML 解析器可能带来的问题：如果 HTML 不正确有效，它们经常会窒息，其中一个简单的只要您正在寻找正确的东西，无论多么糟糕，正则表达式都可以解决任何问题。

正如 CanSpice 上面所说的， (.*) 是贪婪的。非贪婪修饰符通常会做你想做的事。但是，另一种选择是让它变得贪婪，但要确保它不会抓取图像标签的引用 src 属性之外的任何内容：

/<img src="http:\/\/www\.gamereplays.org\/community\/style_medals\/([^"]*)$id2\.gif"[^>]*>/

注意：我还将它修改为不在乎是否有 alt 属性。但是，我不熟悉您从中获取内容的站点。

如果它是生成的代码，它应该没问题，除非他们大规模地改变某些东西。但是为了避免这种意外情况，即使不使用适当的 HTML 解析器，您也可能希望自己为图像标签编写一个迷你解析器——将图像标签提取到哈希的键中（使用/<之类的正则表达式获取它们\s*(img\s+[^>] )\s >/) 然后对于散列中的每个键（使用散列避免重复），然后将引号内的所有内容读取到单独的存储中并替换引用的值以删除引号内的任何空格，然后将其拆分为空格上的属性（元素 0 是标记名，其余是您在 = 上拆分为值的属性，取回您刚才存储的值（或者当它们没有值时将其视为“0E0” - 从而保持它们为真，但实际上毫无价值）

但是，如果它是手写代码，那么您可能会遇到一些噩梦，因为许多人在使用引号时与他们在属性上的使用不一致，如果他们使用它们的话。

perl - 使用 UserAgent 的 Perl 问题在循环中获取网站

3 回答 3

Related

Reference