c# - 正则表达式在 html 标记后查找缺失的空格

Question

从超过 10000 行文本中，我需要找到缺少一组 html 标记后空格的字符串的所有实例。一组 HTML 标签是有限的，它们如下。

 , , , , <ul> </ul>, <li> </li>, <ol> </ol>

运行 Regx 后，应该会出现以下字符串。

Hi allgood morning.

在这种情况下，我们在粗体标记后错过了 sapce。

score 3 · Accepted Answer

假设 C#：

StringCollection resultList = new StringCollection();
Regex regexObj = new Regex("^.*<(?:/?b|/?em|/?su[pb]|/?[ou]l|/?li|span style=\"text-decoration: underline;\" data-mce-style=\"text-decoration: underline;\"|/span)>(?! ).*$", RegexOptions.Multiline);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Value);
    matchResult = matchResult.NextMatch();
}

将返回文件中的所有行，其中列表中的标签之一后至少有一个空格。

输入：

This </b> is <b> OK
This <b> is </b>not OK
Neither <b>is </b> this.

输出：

This <b> is </b>not OK
Neither <b>is </b> this.

解释：

^      # Start of line
.*     # Match any number of characters except newlines
<      # Match a <
(?:    # Either match a...
 /?b   #  b or /b
|      # or 
 /?em  #  em or /em
|...   # etc. etc.
)      # End of alternation
>      # Match a >
(?! )  # Assert that no space follows
.*     # Match any number of characters until...
$      # End of line

c# - 正则表达式在 html 标记后查找缺失的空格

1 回答 1

Related

Reference