regex - 为什么正则表达式找不到数字

Question

我需要从 html 代码中提取字符串。我有一个正则表达式。打开文件后（或发出“获取”请求后），我需要找到模式。

所以，我有一个 html 代码，我想找到这样的字符串：

<input type="hidden" name="qid" ... anything is possible bla="blabla" ... value="8">

我想找到字符串qid，然后在它后面找到字符串value="435345"并提取 435345。

现在我只是想找到这个字符串（我已经完成了），然后我将进行替换（我会这样做），但是这段代码找不到模式。怎么了？

open(URLS_OUT, $foundResults);
@lines = <URLS_OUT>;
$content = join('', @lines);

$content =~ /<qid\"\s*value=[^>][0-9]+/;
print 'Yes'.$1.'\n';

close(URLS_OUT);

或此代码：

my $content = $response->content(); 

while ($content =~ /<qid\"\s*value=[^>][0-9]+/g)
    {
        print 'Yes'.$1.'\n';
    }

我检查了文件不为空并且正确打开（我已经打印出来了），但是我的程序找不到模式。怎么了？我已经使用这个引用（和其他一些）检查了正则表达式：http: //gskinner.com/RegExr/ 它表明正则表达式是正确的并且找到了我需要的。

score 3 · Accepted Answer

像这样更新您的正则表达式：

/<qid\"\s*value=([^>][0-9]+)/

即添加“（”和“）”来捕获数据$1

score 3 · Accepted Answer

你的想法如何：

$content =~ /<qid\"\s*value=[^>][0-9]+/;

作品是错误的。请学习Perl 中的基本 Regex 用法。

顺便说一句：您不应该通过正则表达式解析 HTML。网上有很多例子，关于如何正确地做到这一点。查一下！

出于学习目的，您的正则表达式将如下所示（根据您的评论）：

my $content = q{
 <input type="hidden" id="qid" name="qid" bla="blabla" value="8">
 <input type="hidden" id="qid" name="qid" bla="blabla" value="98">
 <input type="hidden" id="qid" name="qid" bla="blabla" value="788">
 <input type="hidden" id="qid" name="qid" bla="blabla" value="128">
 <input type="hidden" id="qid" name="qid" bla="blabla" value="8123">
};
my $regex = qr{ name=     # find the attribute 'name'
                "qid"     # with a content of "quid"
                .+?       # now search along until the next 'value'
                value=    # the following attribute 'value' 
                "(\d+)    # find the number and capture it
              }x;   ## allow the regex to be formatted   

while( $content =~ /$regex/g ) { # /g - search along
   print "Yes $1 \n"
}

完成这项工作后，请研究如何使用HTML-Parser阅读内容。

score 3 · Accepted Answer

使用HTML::Parser处理混乱的现实世界 HTML。

#! /usr/bin/env perl

use strict;
use warnings;

use HTML::Parser;

sub start {
  my($attr,$attrseq) = @_;
  while (defined(my $name = shift @$attrseq)) {  # first ...="qid"
    last if $attr->{$name} eq "qid";
  }
  while (defined(my $name = shift @$attrseq)) {  # then value="<num>"
    if ($name eq "value" && $attr->{$name} =~ /\A[0-9]+\z/) {
      print "Yes", $attr->{$name}, "\n";
    }
  }
}

my $p = HTML::Parser->new(
  api_version => 3,
  start_h => [\&start, "attr, attrseq"],
);
$p->parse_file(*DATA);

__DATA__
<input type="hidden" name="qid" value="8">
<input type="hidden" name="qidx" value="000000">
<foo type="hidden" name="qid" value="9">
<foo type="hidden" name="qid" value="000000x">
<foo type="hidden" name="QID" value="000000">
<bar type="hidden" NAME="qid" value="10">
<baz type="hidden" name="qid" VALUE="11">
<quux type="hidden" NAME="qid" VALUE="12">

输出：

是的8
是的9
是10
是11
是12

score 2 · Accepted Answer

要$1包含一个值，您需要使用Capture Group。尝试：

$content =~ /<qid\"\s*value=([^>][0-9]+)/;

score 0 · Accepted Answer

对于您提供的示例，您的正则表达式应如下所示：

$content =~ m{
               \"       # match a double quote
               qid      # match the string: qid
               \"       # match a double quote
               [^>]*    # match anything but the closing >
               value    # match the string: value
               \=       # match an equal sign
               \"       # match a double quote
               (\d+)    # capture a string of digits
               \"       # match a double quote
             }msx;

regex - 为什么正则表达式找不到数字

5 回答 5

Related

Reference