3

href我在标签的属性中有一组具有非法语法的 HTML 文件<a>。例如,

<a name="Conductor, "neutral""></a>

或者

<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />

或者

<b>Table of Contents:</b><ul class="xoxo"><li><a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">What are "series" and "parallel" circuits?</a>

我正在尝试使用 Perl 的XML::Twig模块处理文件parsefile_html($file_name)。当它读取具有此语法的文件时,会出现以下错误:

x has an invalid attribute name 'y""' at C:/strawberry/perl/site/lib/XML/Twig.pm line 893

我需要的是一种让模块接受错误语法并处理它的方法,或者是一个正则表达式来用单引号查找和替换属性中的双引号。

4

2 回答 2

2

给定您的 html 示例,以下代码有效:

use Modern::Perl;

my $html = <<end;
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, "neutral""></a>
end

$html =~ s/(?<=content=")(.*?)(?="\s*\/>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg;
$html =~ s/(?<=name=")(.*?)(?="\s*>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg;

say $html;

输出:

<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, neutral"></a>

我担心没有实现可变长度的后视,所以如果等号之前或之后有一些空格,模式匹配就会失败。但是,很可能页面是一致创建的,因此匹配不会失败。

当然,首先尝试对文件副本进行替换。

于 2012-05-16T04:09:32.240 回答
1

我能想到的合理安全地做到这一点的唯一方法是使用两个嵌套的评估 ( /e) 替换。下面的程序使用这个想法并处理您的数据。

外部替换查找字符串中的所有标签,并将它们替换为包含调整后的属性值的标签。

内部替换查找标记中的所有属性值,并将它们替换为相同的值,并删除所有双引号。

use strict;
use warnings;

my $html = <<'HTML';
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, "neutral""></a>
<a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">
HTML

$html =~ s{(<[^>]+>)}{

  my $tag = $1;

  $tag =~ s{ \w+= " \K ( [^=<>]+ ) (?= " (?: \s+\w+= | \s*/?> )) }
  {
    (my $attr = $1) =~ tr/"//d;
    $attr;
  }egx;

  $tag;
}eg;

print $html;

输出

<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, neutral"></a>
<a href="1.html" title="Page 1: What are series and parallel circuits?">
于 2012-05-16T14:11:10.853 回答