c# - 从字符串中删除 HTML 标记，包括 C#

Question

如何在 C# 中使用正则表达式删除所有 HTML 标签，包括。我的字符串看起来像

  "<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"

score 202 · Accepted Answer

如果您不能使用面向 HTML 解析器的解决方案来过滤掉标签，这里有一个简单的正则表达式。

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

理想情况下，您应该再次通过一个处理多个空格的正则表达式过滤器

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

score 32 · Accepted Answer

我采用了@Ravi Thapliyal 的代码并制作了一个方法：它很简单，可能无法清理所有内容，但到目前为止它正在做我需要它做的事情。

public static string ScrubHtml(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim();
    var step2 = Regex.Replace(step1, @"\s{2,}", " ");
    return step2;
}

score 17 · Accepted Answer

我已经使用这个功能有一段时间了。删除几乎所有你可以扔给它的凌乱的 html 并保持文本完好无损。

        private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);

        //add characters that are should not be removed to this regex
        private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled);

        public static String UnHtml(String html)
        {
            html = HttpUtility.UrlDecode(html);
            html = HttpUtility.HtmlDecode(html);

            html = RemoveTag(html, "<!--", "-->");
            html = RemoveTag(html, "<script", "</script>");
            html = RemoveTag(html, "<style", "</style>");

            //replace matches of these regexes with space
            html = _tags_.Replace(html, " ");
            html = _notOkCharacter_.Replace(html, " ");
            html = SingleSpacedTrim(html);

            return html;
        }

        private static String RemoveTag(String html, String startTag, String endTag)
        {
            Boolean bAgain;
            do
            {
                bAgain = false;
                Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
                if (startTagPos < 0)
                    continue;
                Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
                if (endTagPos <= startTagPos)
                    continue;
                html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
                bAgain = true;
            } while (bAgain);
            return html;
        }

        private static String SingleSpacedTrim(String inString)
        {
            StringBuilder sb = new StringBuilder();
            Boolean inBlanks = false;
            foreach (Char c in inString)
            {
                switch (c)
                {
                    case '\r':
                    case '\n':
                    case '\t':
                    case ' ':
                        if (!inBlanks)
                        {
                            inBlanks = true;
                            sb.Append(' ');
                        }   
                        continue;
                    default:
                        inBlanks = false;
                        sb.Append(c);
                        break;
                }
            }
            return sb.ToString().Trim();
        }

score 4 · Accepted Answer

var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

score 2 · Accepted Answer

我使用了@RaviThapliyal 和@Don Rolling 的代码，但做了一些修改。由于我们将替换为空字符串，但应替换为空格，因此添加了一个额外的步骤。它对我来说就像一种魅力。

public static string FormatString(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>", "").Trim();
    var step2 = Regex.Replace(step1, @"&nbsp;", " ");
    var step3 = Regex.Replace(step2, @"\s{2,}", " ");
    return step3;
}

使用不带分号的 &nbps，因为它正在被 Stack Overflow 格式化。

score 0 · Accepted Answer

这：

(<.+?> | &nbsp;)

将匹配任何标签或 

string regex = @"(<.+?>|&nbsp;)";
var x = Regex.Replace(originalString, regex, "").Trim();

那么 x =hello

score 0 · Accepted Answer

清理 Html 文档涉及很多棘手的事情。这个包可能有帮助： https ://github.com/mganss/HtmlSanitizer

score 0 · Accepted Answer

HTML 的基本形式只是 XML。您可以在 XmlDocument 对象中解析文本，并在根元素上调用 InnerText 来提取文本。这将以任何形式去除所有 HTML 标记，并处理特殊字符，如 < 一口气。

score -1 · Accepted Answer

-1

(<([^>]+)>|&nbsp;)

你可以在这里测试它： https ://regex101.com/r/kB0rQ4/1

于 2017-02-10T17:58:20.397 回答

c# - 从字符串中删除 HTML 标记，包括 C#

9 回答 9

Related

Reference