c# - 如何通过正则表达式“输出”文本

Question

我有一个小问题。我试图让文本从 html 元素中消失。示例输入：

I want this text I want this text I want this text <I don't want this text/>
I want this text I wan this text <I don't>want this</text>

有人知道正则表达式怎么可能吗？我认为我可以通过删除元素文本来实现。那么，有人知道这个问题的另一种解决方案吗？请帮我。

score 3 · Accepted Answer

代替不适合解析一般 HTML（尤其是格式错误的 HTML）的正则表达式，使用HTML Agility Pack之类的 HTML 解析器。

什么是 Html Agility Pack (HAP)？

这是一个敏捷的 HTML 解析器，它构建一个读/写 DOM 并支持普通的 XPATH 或 XSLT（实际上你不必了解 XPATH 或 XSLT 就可以使用它，不用担心......）。它是一个 .NET 代码库，允许您解析“网络之外”的 HTML 文件。解析器对“真实世界”格式错误的 HTML 非常宽容。对象模型与 System.Xml 的提议非常相似，但用于 HTML 文档（或流）。

score 1 · Accepted Answer

试试这个

(?<!<.*?)([^<>]+)

解释

@"
(?<!        # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
   <           # Match the character “&lt;” literally
   .           # Match any single character that is not a line break character
      *?          # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
)
(           # Match the regular expression below and capture its match into backreference number 1
   [^<>]       # Match a single character NOT present in the list “&lt;>”
      +           # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"

score 1 · Accepted Answer

我同意任何不重要的事情都应该使用 HTML 解析器（如果您使用 .NET，敏捷包非常好），但对于小需求，因为这很可能是矫枉过正。再说一次，HTML 解析器更了解 HTML 充满的怪癖和边缘情况。请务必在使用正则表达式之前进行测试。

干得好

<.*?>.*?<.*?>|<.*?/>

它也正确地忽略了

<I don't>want this</text>

而不仅仅是标签

在 C# 中，这变成

string resultString = null;
resultString = Regex.Replace(subjectString, "<.*?>.*?<.*?>|<.*?/>", "");

c# - 如何通过正则表达式“输出”文本

3 回答 3

Related

Reference