c# - 从 AntiXSS v3 输出中清理 html 编码文本（#decimal 表示法）

Question

我想在博客引擎 XSS 安全中发表评论。尝试了很多不同的方法，但发现非常困难。

当我显示评论时，我首先使用Microsoft AntiXss 3.0对整个内容进行 html 编码。然后我尝试使用白名单方法对安全标签进行 html 解码。

在 refactormycode 的 Atwood 的“Sanitize HTML”线程中查看Steve Downing 的示例。

我的问题是 AntiXss 库将值编码为 &#DECIMAL; 符号，我不知道如何重写史蒂夫的例子，因为我的正则表达式知识有限。

我尝试了以下代码，我只是将实体替换为十进制形式，但它不能正常工作。

&lt; with &#60;
&gt; with &#62;

我的重写：

class HtmlSanitizer
{
    /// <summary>
    /// A regex that matches things that look like a HTML tag after HtmlEncoding.  Splits the input so we can get discrete
    /// chunks that start with &lt; and ends with either end of line or &gt;
    /// </summary>
    private static Regex _tags = new Regex("&#60;(?!&#62;).+?(&#62;|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);


    /// <summary>
    /// A regex that will match tags on the whitelist, so we can run them through 
    /// HttpUtility.HtmlDecode
    /// FIXME - Could be improved, since this might decode &gt; etc in the middle of
    /// an a/link tag (i.e. in the text in between the opening and closing tag)
    /// </summary>
    private static Regex _whitelist = new Regex(@"
^&#60;/?(a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&#62;$
|^&#60;(b|h)r\s?/?&#62;$
|^&#60;a(?!&#62;).+?&#62;$
|^&#60;img(?!&#62;).+?/?&#62;$",


      RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
      RegexOptions.ExplicitCapture | RegexOptions.Compiled);

    /// <summary>
    /// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using 
    /// a whitelist based approach, leaving the dangerous tags Encoded HTML tags
    /// </summary>
    public static string Sanitize(string html)
    {

        string tagname = "";
        Match tag;
        MatchCollection tags = _tags.Matches(html);
        string safeHtml = "";

        // iterate through all HTML tags in the input
        for (int i = tags.Count - 1; i > -1; i--)
        {
            tag = tags[i];
            tagname = tag.Value.ToLowerInvariant();

            if (_whitelist.IsMatch(tagname))
            {
                // If we find a tag on the whitelist, run it through 
                // HtmlDecode, and re-insert it into the text
                safeHtml = HttpUtility.HtmlDecode(tag.Value);
                html = html.Remove(tag.Index, tag.Length);
                html = html.Insert(tag.Index, safeHtml);
            }

        }

        return html;
    }

}

我的输入测试html是：

<p><script language="javascript">alert('XSS')</script><b>bold should work</b></p>

AntiXss 之后变成：

&#60;p&#62;&#60;script language&#61;&#34;javascript&#34;&#62;alert&#40;&#39;XSS&#39;&#41;&#60;&#47;script&#62;&#60;b&#62;bold should work&#60;&#47;b&#62;&#60;&#47;p&#62;

当我运行上面的 Sanitize(string html) 版本时，它给了我：

<p><script language="javascript">alert&#40;&#39;XSS&#39;&#41;</script><b>bold should work</b></p>

正则表达式匹配我不想要的白名单中的脚本。对此的任何帮助将不胜感激。

score 1 · Accepted Answer

您是否考虑过使用 Markdown 或 VBCode 或一些类似的方法让用户标记他们的评论？然后您可以禁止所有 HTML。

如果您必须允许 HTML，那么我会考虑使用 HTML 解析器（本着 HTMLTidy 的精神）并在那里进行白名单。

score 1 · Accepted Answer

是的，我正在使用带有 Markdown 的 WMD 编辑器，但我希望用户能够像在 Stack Overflow 上一样发布 HTML 和代码示例，所以我不想完全禁止 HTML。

我一直在看HTML Tidy，但还没有尝试过。然而，我使用Html Agility Pack来确保 HTML 是正确的（没有孤立标签）。这是在我运行 AntiXss 之前完成的。

如果我不能让我当前的解决方案按我喜欢的方式工作，我会尝试 HTML Tidy，谢谢你的建议。

score 1 · Accepted Answer

您的问题是 C# 错误地解释了您的正则表达式。您需要转义#-符号。没有逃逸它匹配太多。

private static Regex _whitelist = new Regex(@"
    ^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$
    |^&\#60;(b|h)r\s?(&\#47;)?&\#62;$
    |^&\#60;a(?!&\#62;).+?&\#62;$
    |^&\#60;img(?!&\#62;).+?(&\#47;)?&\#62;$",

    RegexOptions.Singleline |
    RegexOptions.IgnorePatternWhitespace |
    RegexOptions.ExplicitCapture 
    RegexOptions.Compiled
 );

更新 2：您可能对此xss和regexp站点感兴趣。

score 0 · Accepted Answer

我在 Mac 上，所以我无法测试你的 C# 代码。但对我来说，您似乎应该使 _whitelist 正则表达式仅适用于标签名称。这可能意味着您必须进行两次通过，一次用于打开标签，一次用于关闭标签。但这会使它变得简单得多。

score 0 · Accepted Answer

如果有人有兴趣使用它，我将在这里再次发布完整的代码（稍微重构并更新评论）。

我还决定从白名单中删除 img 标签，因为@Pez 和@some 指出允许这样做可能很危险。

还必须指出，我还没有针对可能的 XSS 攻击进行适当的测试。这只是我了解这种方法的效果如何的一个说明点。

class HtmlSanitizer
{
    /// <summary>
    /// A regex that matches things that look like a HTML tag after HtmlEncoding to &#DECIMAL; notation. Microsoft AntiXSS 3.0 can be used to preform this. Splits the input so we can get discrete
    /// chunks that start with &#60; and ends with either end of line or &#62;
    /// </summary>
    private static readonly Regex _tags = new Regex(@"&\#60;(?!&\#62;).+?(&\#62;|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);


    /// <summary>
    /// A regex that will match tags on the whitelist, so we can run them through 
    /// HttpUtility.HtmlDecode
    /// FIXME - Could be improved, since this might decode &#60; etc in the middle of
    /// an a/link tag (i.e. in the text in between the opening and closing tag)
    /// </summary>

    private static readonly Regex _whitelist = new Regex(@"
^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$
|^&\#60;(b|h)r\s?(&\#47;)?&\#62;$
|^&\#60;a(?!&\#62;).+?&\#62;$",


      RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
      RegexOptions.ExplicitCapture | RegexOptions.Compiled);

    /// <summary>
    /// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using 
    /// a whitelist based approach, leaving the dangerous tags Encoded HTML tags
    /// </summary>
    public static string Sanitize(string html)
    {
        Match tag;
        MatchCollection tags = _tags.Matches(html);

        // iterate through all HTML tags in the input
        for (int i = tags.Count - 1; i > -1; i--)
        {
            tag = tags[i];
            string tagname = tag.Value.ToLowerInvariant();

            if (_whitelist.IsMatch(tagname))
            {
                // If we find a tag on the whitelist, run it through 
                // HtmlDecode, and re-insert it into the text
                string safeHtml = HttpUtility.HtmlDecode(tag.Value);
                html = html.Remove(tag.Index, tag.Length);
                html = html.Insert(tag.Index, safeHtml);
            }
        }
        return html;
    }
}

c# - 从 AntiXSS v3 输出中清理 html 编码文本（#decimal 表示法）

5 回答 5

Related

Reference