我有一个这样的 XML 文档:

  <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz</description>

我需要在 Perl 中解析它,然后在一些单词或短语周围添加新标签(例如链接到定义)。我只想标记目标词的第一个实例,并将搜索范围缩小到给定标记中的内容(例如仅描述标记)。

我可以用XML::Twig解析并为描述标签设置一个“twig_handler”。但是当我调用$node->text时,我会得到删除了中间标签的文本。我真正想做的是遍历(非常小的)树,以便保留现有标签而不是破坏。因此,最终的 XML 输出应如下所示:

  <description>Article about <b><a href="dictionary.html#frobnitz">frobnitz</a></b>, <a href="dictionary.html#crulps">crulps</a> and <a href="dictionary.html#furtikurty">furtikurty</a>'s. Mainly frobnitz</description>



use strict;
use warnings;

use XML::Twig;

my %dictionary = (
    frobnitz    => 'dictionary.html#frobnitz',
    crulps      => 'dictionary.html#crulps',
    furtykurty  => 'dictionary.html#furtykurty',

sub markup_plain_text { 
    my ( $text ) = @_;

    foreach my $k ( keys %dictionary ) {
        $text =~ s/(^|\W)($k)(\W|$)}/$1<a href="$dictionary{$k}">$2<\/a>$3/si;

    return $text;

sub convert {
    my( $t, $node ) = @_;
    warn "convert: TEXT=[" . $node->text . "]\n";
    $node->set_text( markup_plain_text($node->text) );
    return 1;

sub markup {
    my ( $text ) = @_;

    my $t = XML::Twig->new(
        twig_handlers => { description => \&convert },
        pretty_print  => 'indented',
    $t->parse( $text );

    return $t->flush;

my $orig = <<END_XML;
    <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz's</description>


1 回答 1


这是一个有点棘手的问题,但是 XML::Twig 是为这种处理而设计的(并且我大量使用它)。所以有一个特定的方法,称为mark,它采用正则表达式并标记匹配项。

在这种情况下,正则表达式可能会很大。我使用 Regexp::Assempble 来构建它,所以它得到了优化。然后另一个问题是mark不允许您使用匹配的文本来设置属性(我可能会在模块的下一个版本中处理这个问题,这会很有用),所以我必须先标记,然后再去返回并href在第二遍中设置属性(在任何情况下,第二遍都需要“取消链接”已经链接的单词)。



use strict;
use warnings;

use XML::Twig;
use Regexp::Assemble;

use Test::More tests => 1; 
use autodie qw(open);

my %dictionary = (
    frobnitz    => 'definitions.html#frobnitz',
    crulps      => 'definitions.html#crulps',
    furtikurty  => 'definitions.html#furtikurty',

my $match_defs= Regexp::Assemble->new()
                                ->add( keys %dictionary)
# I am not familiar enough with Regexp::Assemble to know a cleaner
# way to get get the capturing braces in the regexp
$match_defs= qr/($match_defs)/; 

my $in       = data_para(); 
my $expected = data_para();
my $out;
open( my $out_fh, '>', \$out);

XML::Twig->new( twig_roots => { 'description' => sub { tag_defs( @_, $out_fh, $match_defs, \%dictionary); } },
                twig_print_outside_roots => $out_fh, 
         ->parse( $in);

is( $out, $expected, 'base test');

sub tag_defs
  { my( $t, $description, $out_fh, $match_defs, $dictionary)= @_;

    my @a= $description->mark( $match_defs, 'a' );

    # word => 1 when already used in this description
    # this might need to have a different scope if you need to tag
    # only the first time the word appears in a section or whatever
    my $tagged_in_description; 

    foreach my $a (@a) 
      { my $word= $a->text;
        warn "checking a: ", $a->sprint, "\n";

        if( $tagged_in_description->{$word})
          { $a->erase; } # we did not need to tag it after all
          { $a->set_att( href => $dictionary->{$word}); }

    $t->flush( $out_fh); }

sub def_href
  { my( $word)= @_;
    return $dictionary{word};

sub data_para
  { local $/="\n\n";
    my $para= <DATA>;
    return $para;

  <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz</description>

  <description>Article about <b><a href="definitions.html#frobnitz">frobnitz</a></b>, <a href="definitions.html#crulps">crulps</a> and <a href="definitions.html#furtikurty">furtikurty</a>'s. Mainly frobnitz</description>
于 2011-05-12T08:19:00.273 回答