c# - HTML 属性剥离器

Question

我想使用 C# 和 RegEx 去除 HTML 字符串中的所有属性（及其值）。

例如：

<p>This is a text</p><span class="cls" style="background-color: yellow">This is another text</span>

会成为

<p>This is a text</p><span>This is another text</span>

此外，我需要删除所有属性，无论它们的值是否被引号包围。

IE

<p class="cls">Some content</p>
<p class='cls'>Some content</p>
<p class=cls>Some content</p>

都应该导致

<p>Some content</p>

由于安全原因，我不能使用 HTMLAgilityPack，所以我需要使用 RegEx 来执行此操作。

score 0 · Accepted Answer

我有一个没有正则表达式的解决方案。我们混合使用SubString()和IndexOf()。我不检查任何错误。这只是一个想法。

工作演示

C＃：

private static void Main(string[] args)
{
    string s = @"<p>This is a text</p><span class=""cls"" style=""background-color: yellow"">This is another text</span>";

    var list = s.Split(new[] {"<"}, StringSplitOptions.RemoveEmptyEntries);
    foreach (var item in list)
        Console.Write(ClearAttributes('<' + item));
    Console.ReadLine();
}

private static string ClearAttributes(string source)
{
    int startindex = source.IndexOf('<');
    int endindex = source.IndexOf('>');
    string tag = source.Substring((startindex + 1), (endindex - startindex - 1));
    int spaceindex = tag.IndexOf(' ');
    if (spaceindex > 0)
        tag = tag.Substring(0, spaceindex);
    return String.Concat('<', tag, source.Substring(endindex));
}

输出：

<p>This is a text</p><span>This is another text</span>

c# - HTML 属性剥离器

1 回答 1

Related

Reference