regex - Perl：为什么这个网络爬虫正则表达式工作不一致？

Question

我遇到了与我要抓取的网站有关的另一个问题。

基本上我已经从页面内容中删除了大部分我不想要的内容，并且由于这里提供的一些帮助，我设法隔离了我想要的日期。尽管一些初始问题与非破坏空间相匹配，但其中大部分似乎工作正常。但是，我现在在使用最终的正则表达式时遇到了困难，该正则表达式旨在将每行数据拆分为字段。每条线代表股价指数的价格。每行的字段是：

一个任意长度的名称，由拉丁字母表中的字符组成，有时是逗号或与号，没有数字。
小数点后两位数（索引的绝对值）。
小数点后有两位数的数字（值的变化）。
小数点后有两位数字，后跟百分号（值的百分比变化）。

这是一个示例字符串，在拆分之前：“Fishery, Agriculture & Forestry243.45-1.91-0.78% Mining360.74-4.15-1.14% Construction465.36-1.01-0.22% Foods783.2511.281.46% Textiles & Apparels412.070.540. 13% 纸浆和造纸 333.31-0.29-0.09% 化学品 729.406.010.83% "

我用来分割这一行的正则表达式是这样的：

$mystr =~ s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;

它有时有效，但有时无效，我无法弄清楚为什么会这样。（下面示例输出中的双重等号用于使字段拆分更容易可见。）

Fishery, Agriculture & Forestry == 243.45 == -1.91 == -0.78%
Mining360.74-4.15-1.14%
Construction == 465.36 == -1.01 == -0.22%
Foods783.2511.281.46%

我认为对于那些指数价格出现负变化的指数来说，减号是一个问题，但有时尽管有减号，它仍然有效。

问：为什么下面显示的最终正则表达式无法一致地拆分字段？

示例代码如下。

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;

my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";

my $content = get($url_full);
# get dates:
(my @dates) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;
foreach my $date (@dates) { # convert to yyyy-mm-dd
    $date =~ s/\//-/ig;
}
my $tree = HTML::Tree->new();
$tree->parse($content);
my $mystr = $tree->as_text;

$mystr =~ s/\xA0//gi; # remove non-breaking spaces
# remove first chunk of text:
$mystr =~
  s/^(TSE.*?)IndustryIndexChange ?/IndustryIndexChange\n$dates[0]\n\n/gi;
$mystr =~ s/IndustryIndexChange ?/IndustryIndexChange/ig;
$mystr =~ s/IndustryIndexChange/Industry Index Change\n/ig;
$mystr =~ s/% /%\n/gi; # percent symbol is market for end of line
# indicate breaks between days:
$mystr =~ s/Stock.*?IndustryIndexChange/\nDAY DELIMITER\n/gi;
$mystr =~ s/Exemption from Liability.*$//g; # remove boilerplate at bottom

# and here's the problem regex...
# try to split it:
$mystr =~
  s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;

print $mystr;

score 2 · Accepted Answer

它似乎在做其他每一件事。

我的猜测是您的记录\n之间只有一个，但您的模式以\n. 所以\n第一场比赛的决赛消耗了\n第二场比赛找到第二条记录所需的。最终结果是它拾取了所有其他记录。

您最好将模式包装在^and $（而不是\nand \n）中，并m在s///.

score 2 · Accepted Answer

问题是您\n在正则表达式的开头和结尾都有。

考虑这样的事情：

$s = 'abababa';
$s =~ s/aba/axa/g;

这将设置$s为axabaxa,不是 axaxaxa, 因为只有两个不重叠的aba.

score 0 · Accepted Answer

我的解释（伪代码）-

one   = [a-zA-Z,& ]+
two   = \d{1,4}.\d\d
three = <<two>>
four  = <<two>>%

regex = (<<one>>)(<<two>>)(<<three>>)(<<four>>)
      = ([a-zA-Z,& ]+)(\d{1,4}.\d\d)(\d{1,4}.\d\d)(\d{1,4}.\d\d%)

但是，您已经看到了 HTML 形式的“结构化”数据。为什么不利用这一点呢？

perl 中的HTML 解析为 perl中的基于 DOM 的解析引用了 MOJO，除非有严重的性能原因，否则我强烈推荐这种方法。

regex - Perl：为什么这个网络爬虫正则表达式工作不一致？

3 回答 3

Related

Reference