.net - 使用正则表达式用 XML 标记包装部分文本

Question

我们正在开发我们的内部工具来为我们的 .NET 产品生成文档。

作为其功能的一部分，我们需要用<para>标签包装普通段落。

在这种情况下，“普通段落”意味着它是一行文本，可能带有一些类似 XML 的内联标签，但不在其他块标签内，如<cell>or <description>。

源文件示例：

Description paragraph #1.
Description paragraph #2.
<code>
Method1();
Method2();
</code>
<list type="number">
  <item>
    <description>
      If you need to do something, use the <see cref="P:foo1" /> method.
    </description>
  </item>
  <item>
    <description> The <see cref="P:foo2" /> method does this.
The <see cref="P:foo3" /> method does that.</description>
  </item>
</list>

<section>
<title>Section title</title>
<content>
Section paragraph #1.
Section paragraph #2.
</content>
</section>

这应该转换为以下内容：

<para>Description paragraph #1.</para>
<para>Description paragraph #2.</para>
<code>
Method1();
Method2();
</code>
<list type="number">
  <item>
    <description>
      If you need to do something, use the <see cref="P:foo1" /> method.
    </description>
  </item>
  <item>
    <description> The <see cref="P:foo2" /> method does this.
The <see cref="P:foo3" /> method does that.</description>
  </item>
</list>

<section>
<title>Section title</title>
<content>
<para>Section paragraph #1.</para>
<para>Section paragraph #2.</para>
</content>
</section>

正式地，任务听起来像这样：用 .. 包裹每一行文本，但不仅限于它不在其他标签的有限列表中。标签中的每个未来段落都允许出现 CR/LF、制表符、空格字符等空格。

显然，应该为此使用正则表达式，但我们还没有设法构建适合这种情况的东西。有什么想法或提示吗？

score 1 · Accepted Answer

你说“显然应该使用正则表达式”。许多人会说您在该断言中缺少“不”。看到这个众所周知的答案。

如果您确定没有嵌套外层标签，您可能可以拆分一些可怕的正则表达式，例如：

(<list([^<]|<(?!/list))+</list>)|(<code([^<]|<(?!/code))+</code>)|([^\n]+)

并替换非标签部分的那些匹配项。但实际上，您为什么不使用众多 XML 解析器之一并简单地替换适当的文本节点呢？

score 0 · Accepted Answer

很难从您的示例中推断出全部要求，但是如果您的示例是典型的，那么在将提供的内容包装在一个<wrapper>元素中以使其格式正确之后，以下 XSLT 2.0 样式表将完成这项工作：

<xsl:template match="/wrapper/*">
  <xsl:copy-of select="."/>
</xsl:template>

<xsl:template match="/wrapper/text()">
  <xsl:for-each select="tokenize(., '\n')">
    <para><xsl:copy-of select="."/></para>
  </xsl:for-each>
</xsl:template>

.net - 使用正则表达式用 XML 标记包装部分文本

2 回答 2

Related

Reference