c# - 去除 h2 标签之间的内容（包括 h2 标签）

Question

我正在尝试使用 C# 中的正则表达式从字符串中的 h2 标记之间剥离内容：

<h2>content needs removing</h2> other content...

我有以下正则表达式，根据我用来测试它的正则表达式伙伴软件，它应该可以工作，但它不能：

myString = Regex.Replace(myString, @"<h[0-9]>.*</h[0-9]>", String.Empty);

我有另一个在此之后运行以删除所有其他 HTML 标记的正则表达式，它以相同的方式调用并且工作正常。谁能帮我解决为什么这不起作用？

score 4 · Accepted Answer

不要使用正则表达式。

HTML不是正则语言，因此无法使用正则表达式正确解析。

例如，您的正则表达式将匹配：

<h2>sample</h1>

这是无效的。处理嵌套结构时，这会导致意外结果（.*贪婪并匹配所有内容，直到h[0-9]输入 HTML 字符串中的最后一个结束标记）

您可以使用XMLDocument（HTML 不是 XML，但这足以满足您的要求），也可以使用Html Agility Pack。

score 2 · Accepted Answer

试试这个代码：

String sourcestring = "<h2>content needs removing</h2> other content...";
String matchpattern = @"\s?<h[0-9]>[^<]+</h[0-9]>\s?";
String replacementpattern = @"";
MessageBox.Show(Regex.Replace(sourcestring,matchpattern,replacementpattern));

[^<]+比.+因为它停止收集它看到的地方更安全<。

score 1 · Accepted Answer

这对我来说很好：

string myString = "<h2>content needs removing</h2> other content...";
Console.WriteLine(myString);
myString = Regex.Replace(myString, "<h[0-9]>.*</h[0-9]>", string.Empty);
Console.WriteLine(myString);

显示：

<h2>content needs removing</h2> other content...
other content...

正如预期的那样。

如果你的问题是你的真实案例有几个不同的标题标签，那么你有一个贪婪的 * 量词的问题。它将创建最长的匹配。例如，如果您有：

<h2>content needs removing</h2> other content...<h3>some more headings</h3> and some other stuff

您将匹配从<h2>to 的所有内容</h3>并替换它。要解决此问题，您需要使用惰性量词：

myString = Regex.Replace(myString, "<h[0-9]>.*?</h[0-9]>", string.Empty);

会给你留下：

other content... and some other stuff

但是请注意，这不会修复嵌套<h>标签。正如@fardjad 所说，对 HTML 使用 Regex 通常不是一个好主意。

c# - 去除 h2 标签之间的内容（包括 h2 标签）

3 回答 3

Related

Reference