1

我正在尝试通过中间 HTML 步骤将 DOCX 转换为 DITA 主题。

现在,通过在 'sed' 或 'emacs' 或 'vi' 中的简单替换,我可以进行大部分更改,但不能进行某些类型的更改。为此,我可能需要 Perl 或 Python。下面是我想要完成的一个例子:

从:

<h1> Head 1 </H1>
  <body> 
  </body>


 <h2>Sub Head 1 </h2>
  <body>
  </body>


  <h3>SubSub Head 1 </h3>
   <body> 
   </body>

 <h2>Sub Head 2 </h2>
 <body> 
 </body>

<h1>Head 2 </h1>
<body> 
</body>

至:

<topic><title> Head 1 </title>
  <body> 
  </body>

 <topic><title> Sub Head 1 </title>
  <body>
  </body>

  <topic><title> SubSub Head 1 </title>
   <body> 
   </body>
  </topic>
 </topic>

 <topic><title> Sub Head 2 </title>
 <body> 
 </body>
 </topic>
</topic>

<topic><title> Head 2 </title>
<body> 
</body>
</topic>

我遇到麻烦的部分是我需要为嵌套主题放置标签的部分(是的,我确实有嵌套主题;我的需求有些独特,因为我正在迁移现有文档)。如果有人可以为此建议一个 perl 片段(或指向类似片段的指针)(基于每个标签放置标签),我可以围绕它构建我的脚本。

提前感谢您的关注和建议。

4

1 回答 1

0

这就是我经常使用XML::Twig进行的处理。

wrap_children方法就是为此而设计的:它允许您定义一个类似正则表达式的表达式,该表达式将被包装在一个元素中。有关更多信息,请参见下面的示例和文档:

#!/usr/bin/perl

use strict;
use warnings;

use Test::More tests => 1;

use XML::Twig;

# reads the DATA section, the input doc first, then the expected result
my( $in, $expected)= do{ local $/="\n\n"; <DATA>}; 

my $t=XML::Twig->new->parse( $in);
my $root= $t->root;

# that's where the wrapping occurs, form inside out
$root->wrap_children( '<h3><body>',                   topic => { level => 3 });
$root->wrap_children( '<h2><body><topic level="3">*', topic => { level => 2 });
$root->wrap_children( '<h1><body><topic level="2">*', topic => { level => 1 });

# now we cleanup: the levels are not used any more
foreach my $to ($t->descendants( 'topic'))
  { $to->del_att( 'level'); }

# the wrapping will have generated tons of additional id's, 
# you may not need this if your elements had id's before the wrapping
foreach my $to ($t->descendants( 'topic|body|h1|h2|h3'))  
  { $to->del_att( 'id'); }

# now we can deal with titles
foreach my $h  ($t->descendants( 'h1|h2|h3')) { $h->set_tag( 'title'); }

# how did we do?
is( $t->sprint( pretty_print => 'indented'), $expected, 'just one test');

__DATA__
<doc>
  <h1> Head 1 </h1>
    <body></body>
  <h2> Sub Head 1 </h2>
    <body></body>
  <h3> SubSub Head 1 </h3>
    <body></body>
  <h2> Sub Head 2 </h2>
    <body></body>
  <h1> Head 2 </h1>
    <body></body>
</doc>

<doc>
  <topic>
    <title> Head 1 </title>
    <body></body>
    <topic>
      <title> Sub Head 1 </title>
      <body></body>
      <topic>
        <title> SubSub Head 1 </title>
        <body></body>
      </topic>
    </topic>
    <topic>
      <title> Sub Head 2 </title>
      <body></body>
    </topic>
  </topic>
  <topic>
    <title> Head 2 </title>
    <body></body>
  </topic>
</doc>
于 2014-02-06T17:22:47.533 回答