c# - 使用正则表达式解析 Dreamweaver 模板

Question

我需要从 Dreamweaver 模板中解析内容。我正在使用 C#。

这是我需要解析的一些示例内容。

<div id="myDiv">
    <h1><!-- InstanceBeginEditable name="PageHeading" -->
    The Heading<!-- InstanceEndEditable --></h1>
    <!-- InstanceBeginEditable name="PageContent" -->
    <p>
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed nibh turpis, 
    sagittis vitae convallis at, fringilla nec augue.</p>
    <p>
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
    Sed nibh turpis, sagittis vitae convallis at, fringilla nec augue.</p>
    <!-- InstanceEndEditable -->
</div><!-- END #myDiv-->

Dreamweaver 模板基于 HTML 注释，并带有表示其用途的特定字符串。它们对我来说关键如下，因为它们表示页面中可编辑区域的开始和结束。

<!-- InstanceBeginEditable name="xxxxxx" -->
<!-- InstanceEndEditable -->

从我的示例 HTML 中可以看出，源代码中可能还有其他注释。

所以从简单开始，我有以下内容，它匹配所有打开的可编辑区域标签。

<!-- InstanceBeginEditable(.*)?-->

所以接下来我想得到那里和下一个之间的一切“

<!-- InstanceBeginEditable(.*)?-->(?<content>(.*)?)<!-- InstanceEnd

你能告诉我为什么会这样吗？我会想到非贪婪捕获（。*）？在我已经工作的代码和文字之间

<!—InstanceEnd

会符合我的需要...

score 1 · Accepted Answer

你不想在.*.

这意味着贪婪地抓住一切，或者不。

(.*)?

这意味着懒惰地抓住一切：

.*?

此外，在您的正则表达式中，结束标记中只有一个-。将其更改为：

<!-- InstanceBeginEditable.*?-->(?<content>.*?)<!-- InstanceEnd

顺便说一句，.*在没有原子组的正则表达式中有两个 s 是很危险的。在意外数据上，您可以获得灾难性的回溯。我建议将第一个更改.*?为[^-]*. 而且，当我这样做时，我建议您更宽容地处理空格：

<!--\s*InstanceBeginEditable[^-]*-->(?<content>.*?)<!--\s*InstanceEnd

您可能已经知道这一点，但让我在 .NET 中添加它，您需要使用 RegexOptions.Singleline。

score 0 · Accepted Answer

使用 HTML Agility Pack，在此处查看我的答案，如何在 C# 中使用正则表达式解析 HTML？

c# - 使用正则表达式解析 Dreamweaver 模板

2 回答 2

Related

Reference