javascript - 在客户端清理/重写 HTML

Question

我需要显示通过跨域请求加载的外部资源，并确保只显示“安全”内容。

可以使用 Prototype 的String#stripScripts删除脚本块。但是诸如onclickor之类的处理程序onerror仍然存在。

是否有任何图书馆至少可以

剥离脚本块，
杀死 DOM 处理程序，
删除列入黑名单的标签（例如：embed或object）。

那么是否有任何与 JavaScript 相关的链接和示例？

score 113 · Accepted Answer

2016 年更新：现在有一个基于 Caja sanitizer的Google Closure包。

它有一个更简洁的 API，被重写以考虑现代浏览器上可用的 API，并与 Closure Compiler 更好地交互。

无耻插件：请参阅caja/plugin/html-sanitizer.js了解经过彻底审查的客户端 html sanitizer。

它被列入白名单，而不是列入黑名单，但白名单可根据CajaWhitelists进行配置

如果要删除所有标签，请执行以下操作：

var tagBody = '(?:[^"\'>]|"[^"]*"|\'[^\']*\')*';

var tagOrComment = new RegExp(
    '<(?:'
    // Comment body.
    + '!--(?:(?:-*[^->])*--+|-?)'
    // Special "raw text" elements whose content should be elided.
    + '|script\\b' + tagBody + '>[\\s\\S]*?</script\\s*'
    + '|style\\b' + tagBody + '>[\\s\\S]*?</style\\s*'
    // Regular name
    + '|/?[a-z]'
    + tagBody
    + ')>',
    'gi');
function removeTags(html) {
  var oldHtml;
  do {
    oldHtml = html;
    html = html.replace(tagOrComment, '');
  } while (html !== oldHtml);
  return html.replace(/</g, '&lt;');
}

人们会告诉您，您可以创建一个元素，然后分配innerHTML然后获取innerTextor textContent，然后在其中转义实体。不要那样做。它容易受到 XSS 注入的影响，因为即使节点从未附加到 DOM<img src=bogus onerror=alert(1337)>也会运行处理程序。onerror

score 40 · Accepted Answer

通过将 Google Caja HTML sanitizer嵌入到 web worker中，可以使其“支持网络” 。sanitizer 引入的任何全局变量都将包含在 worker 中，并且处理发生在它自己的线程中。

对于不支持 Web Worker 的浏览器，我们可以使用 iframe 作为 sanitizer 工作的单独环境。 Timothy Chien 有一个polyfill可以做到这一点，使用 iframe 来模拟 Web Worker，所以这部分已经为我们完成了。

Caja 项目有一个关于如何使用 Caja 作为独立客户端消毒器的 wiki 页面：

签出源代码，然后通过运行构建ant
在您的页面中包含html-sanitizer-minified.js或html-css-sanitizer-minified.js
称呼html_sanitize(...)

工作脚本只需要遵循这些说明：

importScripts('html-css-sanitizer-minified.js'); // or 'html-sanitizer-minified.js'

var urlTransformer, nameIdClassTransformer;

// customize if you need to filter URLs and/or ids/names/classes
urlTransformer = nameIdClassTransformer = function(s) { return s; };

// when we receive some HTML
self.onmessage = function(event) {
    // sanitize, then send the result back
    postMessage(html_sanitize(event.data, urlTransformer, nameIdClassTransformer));
};

（需要更多代码才能使 simworker 库正常工作，但这对于本次讨论并不重要。）

演示：https ://dl.dropbox.com/u/291406/html-sanitize/demo.html

score 21 · Accepted Answer

Never trust the client. If you're writing a server application, assume that the client will always submit unsanitary, malicious data. It's a rule of thumb that will keep you out of trouble. If you can, I would advise doing all validation and sanitation in server code, which you know (to a reasonable degree) won't be fiddled with. Perhaps you could use a serverside web application as a proxy for your clientside code, which fetches from the 3rd party and does sanitation before sending it to the client itself?

[edit] I'm sorry, I misunderstood the question. However, I stand by my advice. Your users will probably be safer if you sanitize on the server before sending it to them.

score 17 · Accepted Answer

现在所有主流浏览器都支持沙盒 iframe，我认为有一种更简单的方法可以保证安全。如果更熟悉此类安全问题的人可以查看此答案，我会很高兴。

注意：此方法在 IE 9 及更早版本中绝对行不通。有关支持沙盒的浏览器版本，请参阅此表。（注意：表格似乎说它在 Opera Mini 中不起作用，但我刚刚尝试过，它起作用了。）

这个想法是创建一个禁用 JavaScript 的隐藏 iframe，将不受信任的 HTML 粘贴到其中，然后让它解析它。然后你可以遍历 DOM 树并复制出被认为是安全的标签和属性。

此处显示的白名单只是示例。白名单的最佳选择取决于应用程序。如果您需要一个更复杂的策略，而不仅仅是标签和属性的白名单，则可以通过此方法来满足，尽管此示例代码无法满足。

var tagWhitelist_ = {
  'A': true,
  'B': true,
  'BODY': true,
  'BR': true,
  'DIV': true,
  'EM': true,
  'HR': true,
  'I': true,
  'IMG': true,
  'P': true,
  'SPAN': true,
  'STRONG': true
};

var attributeWhitelist_ = {
  'href': true,
  'src': true
};

function sanitizeHtml(input) {
  var iframe = document.createElement('iframe');
  if (iframe['sandbox'] === undefined) {
    alert('Your browser does not support sandboxed iframes. Please upgrade to a modern browser.');
    return '';
  }
  iframe['sandbox'] = 'allow-same-origin';
  iframe.style.display = 'none';
  document.body.appendChild(iframe); // necessary so the iframe contains a document
  iframe.contentDocument.body.innerHTML = input;
  
  function makeSanitizedCopy(node) {
    if (node.nodeType == Node.TEXT_NODE) {
      var newNode = node.cloneNode(true);
    } else if (node.nodeType == Node.ELEMENT_NODE && tagWhitelist_[node.tagName]) {
      newNode = iframe.contentDocument.createElement(node.tagName);
      for (var i = 0; i < node.attributes.length; i++) {
        var attr = node.attributes[i];
        if (attributeWhitelist_[attr.name]) {
          newNode.setAttribute(attr.name, attr.value);
        }
      }
      for (i = 0; i < node.childNodes.length; i++) {
        var subCopy = makeSanitizedCopy(node.childNodes[i]);
        newNode.appendChild(subCopy, false);
      }
    } else {
      newNode = document.createDocumentFragment();
    }
    return newNode;
  };

  var resultElement = makeSanitizedCopy(iframe.contentDocument.body);
  document.body.removeChild(iframe);
  return resultElement.innerHTML;
};

安全漏洞：评论者@Explosion 指出href属性可以包含 JavaScript，例如<a href="javascript:alert('Oops')">. 应该可以在清理代码中捕获并删除它，但是上面的代码（还没有）被更新来做到这一点。

你可以在这里试试。

请注意，我在此示例中不允许使用样式属性和标签。如果您允许它们，您可能想要解析 CSS 并确保它对您的目的是安全的。

我已经在几种现代浏览器（Chrome 40、Firefox 36 Beta、IE 11、Android 版 Chrome）和一个旧浏览器（IE 8）上对此进行了测试，以确保它在执行任何脚本之前被保释。我很想知道是否有任何浏览器遇到问题，或者我忽略了任何边缘情况。

score 13 · Accepted Answer

所以，现在是 2016 年，我想我们中的许多人现在都npm在我们的代码中使用模块。sanitize-html似乎是 npm 上的主要选项，尽管还有其他.

这个问题的其他答案为如何推出自己的问题提供了很好的输入，但这是一个足够棘手的问题，经过良好测试的社区解决方案可能是最好的答案。

在命令行上运行它来安装： npm install --save sanitize-html

ES5： var sanitizeHtml = require('sanitize-html'); // ... var sanitized = sanitizeHtml(htmlInput);

ES6： import sanitizeHtml from 'sanitize-html'; // ... let sanitized = sanitizeHtml(htmlInput);

score 12 · Accepted Answer

您无法预料到某处浏览器可能会跳过黑名单以逃避黑名单的每一种可能的奇怪类型的畸形标记，所以不要将其列入黑名单。除了脚本/嵌入/对象和处理程序之外，您可能需要删除更多的结构。

而是尝试将 HTML 解析为层次结构中的元素和属性，然后针对尽可能少的白名单运行所有元素和属性名称。还要根据白名单检查您允许通过的任何 URL 属性（请记住，还有比 JavaScript 更危险的协议：）。

如果输入是格式良好的 XHTML，那么上面的第一部分就容易多了。

与 HTML 清理一样，如果您能找到任何其他方法来避免这样做，请改为这样做。有很多很多潜在的漏洞。如果主要的网络邮件服务在这么多年后仍然在寻找漏洞，那么是什么让您认为自己可以做得更好？

score 4 · Accepted Answer

[免责声明：我是作者之一]

我们为此编写了一个“仅限网络”（即“需要浏览器”）开源库https://github.com/jitbit/HtmlSanitizer，它删除了tags/attributes/styles除“白名单”之外的所有内容。

用法：

var input = HtmlSanitizer.SanitizeHtml("<script> Alert('xss!'); </scr"+"ipt>");

PS 比“纯 JavaScript”解决方案运行得更快，因为它使用浏览器来解析和操作 DOM。如果您对“纯 JS”解决方案感兴趣，请尝试https://github.com/punkave/sanitize-html（非附属）

score 2 · Accepted Answer

上面建议的 Google Caja 库过于复杂，无法配置并包含在我的 Web 应用程序项目中（因此，在浏览器上运行）。因为我们已经使用了 CKEditor 组件，所以我采取的方法是使用它内置的 HTML 清理和白名单功能，这更容易配置。因此，您可以在隐藏的 iframe 中加载 CKEditor 实例并执行以下操作：

CKEDITOR.instances['myCKEInstance'].dataProcessor.toHtml(myHTMLstring)

现在，当然，如果你没有在你的项目中使用 CKEditor，这可能有点矫枉过正，因为组件本身大约是半兆字节（最小化），但如果你有源代码，也许你可以隔离代码白名单（CKEDITOR.htmlParser？）并使其更短。

http://docs.ckeditor.com/#!/api

http://docs.ckeditor.com/#!/api/CKEDITOR.htmlDataProcessor

score 1 · Accepted Answer

我没有使用正则表达式，而是想到了一种使用原生 DOM 的方法。通过这种方式，您可以将 HTML 解析为文档，获取该 HTML 并轻松获取所有特定元素以及要删除的白名单元素和属性。这使用属性列表作为允许的简单属性字符串数组，或者它可以使用正则表达式来验证它们的值并且只允许某些标签。

const sanitize = (html, tags = undefined, attributes = undefined) => {
    var attributes = attributes || [
      { attribute: "src", tags: "*", regex: /^(?:https|http|\/\/):/ },
      { attribute: "href", tags: "*", regex: /^(?!javascript:).+/ },
      { attribute: "width", tags: "*", regex: /^[0-9]+$/ },
      { attribute: "height", tags: "*", regex: /^[0-9]+$/ },
      { attribute: "id", tags: "*", regex: /^[a-zA-Z]+$/ },
      { attribute: "class", tags: "*", regex: /^[a-zA-Z ]+$/ },
      { attribute: "value", tags: ["INPUT", "TEXTAREA"], regex: /^.+$/ },
      { attribute: "checked", tags: ["INPUT"], regex: /^(?:true|false)+$/ },
      {
        attribute: "placeholder",
        tags: ["INPUT", "TEXTAREA"],
        regex: /^.+$/,
      },
      {
        attribute: "alt",
        tags: ["IMG", "AREA", "INPUT"],
        //"^" and "$" match beggining and end
        regex: /^[0-9a-zA-Z]+$/,
      },
      { attribute: "autofocus", tags: ["INPUT"], regex: /^(?:true|false)+$/ },
      { attribute: "for", tags: ["LABEL", "OUTPUT"], regex: /^[a-zA-Z0-9]+$/ },
    ]
    var tags = tags || [
      "I",
      "P",
      "B",
      "BODY",
      "HTML",
      "DEL",
      "INS",
      "STRONG",
      "SMALL",
      "A",
      "IMG",
      "CITE",
      "FIGCAPTION",
      "ASIDE",
      "ARTICLE",
      "SUMMARY",
      "DETAILS",
      "NAV",
      "TD",
      "TH",
      "TABLE",
      "THEAD",
      "TBODY",
      "NAV",
      "SPAN",
      "BR",
      "CODE",
      "PRE",
      "BLOCKQUOTE",
      "EM",
      "HR",
      "H1",
      "H2",
      "H3",
      "H4",
      "H5",
      "H6",
      "DIV",
      "MAIN",
      "HEADER",
      "FOOTER",
      "SELECT",
      "COL",
      "AREA",
      "ADDRESS",
      "ABBR",
      "BDI",
      "BDO",
    ]

    attributes = attributes.map((el) => {
      if (typeof el === "string") {
        return { attribute: el, tags: "*", regex: /^.+$/ }
      }
      let output = el
      if (!el.hasOwnProperty("tags")) {
        output.tags = "*"
      }
      if (!el.hasOwnProperty("regex")) {
        output.regex = /^.+$/
      }
      return output
    })
    var el = new DOMParser().parseFromString(html, "text/html")
    var elements = el.querySelectorAll("*")
    for (let i = 0; i < elements.length; i++) {
      const current = elements[i]
      let attr_list = get_attributes(current)
      for (let j = 0; j < attr_list.length; j++) {
        const attribute = attr_list[j]
        if (!attribute_matches(current, attribute)) {
          current.removeAttribute(attr_list[j])
        }
      }
      if (!tags.includes(current.tagName)) {
        current.remove()
      }
    }
    return el.documentElement.innerHTML
    function attribute_matches(element, attribute) {
      let output = attributes.filter((attr) => {
        let returnval =
          attr.attribute === attribute &&
          (attr.tags === "*" || attr.tags.includes(element.tagName)) &&
          attr.regex.test(element.getAttribute(attribute))
        return returnval
      })

      return output.length > 0
    }
    function get_attributes(element) {
      for (
        var i = 0, atts = element.attributes, n = atts.length, arr = [];
        i < n;
        i++
      ) {
        arr.push(atts[i].nodeName)
      }
      return arr
    }
  }

* {
  font-family: sans-serif;
}
textarea {
  width: 49%;
  height: 300px;
  padding: 10px;
  box-sizing: border-box;
  resize: none;
}

<h1>Sanitize HTML client side</h1>
<textarea id='input' placeholder="Unsanitized HTML">
&lt;!-- This removes both the src and onerror attributes because src is not a valid url. --&gt;
&lt;img src=&quot;error&quot; onerror=&quot;alert('XSS')&quot;&gt;
&lt;div id=&quot;something_harmless&quot; onload=&quot;alert('More XSS')&quot;&gt;
   &lt;b&gt;Bold text!&lt;/b&gt; and &lt;em&gt;Italic text!&lt;/em&gt;, some more text. &lt;del&gt;Deleted text!&lt;/del&gt;
&lt;/div&gt;
 &lt;script&gt;
    alert(&quot;This would be XSS&quot;);
  &lt;/script&gt;
</textarea>
<textarea id='output' placeholder="Sanitized HTML will appear here" readonly></textarea>
<script>
  document.querySelector("#input").onkeyup = () => {
    document.querySelector("#output").value = sanitize(document.querySelector("#input").value);
  }
</script>

score 0 · Accepted Answer

我建议从你的生活中删除框架，从长远来看，它会让你的事情变得非常容易。

cloneNode：克隆节点会复制其所有属性及其值，但不会复制事件侦听器。

https://developer.mozilla.org/en/DOM/Node.cloneNode

尽管我已经使用 treewalkers 有一段时间了，但以下内容没有经过测试，它们是 JavaScript 中最被低估的部分之一。这是您可以抓取的节点类型的列表，通常我使用SHOW_ELEMENT或SHOW_TEXT。

http://www.w3.org/TR/DOM-Level-2-Traversal-Range/traversal.html#Traversal-NodeFilter

function xhtml_cleaner(id)
{
 var e = document.getElementById(id);
 var f = document.createDocumentFragment();
 f.appendChild(e.cloneNode(true));

 var walker = document.createTreeWalker(f,NodeFilter.SHOW_ELEMENT,null,false);

 while (walker.nextNode())
 {
  var c = walker.currentNode;
  if (c.hasAttribute('contentEditable')) {c.removeAttribute('contentEditable');}
  if (c.hasAttribute('style')) {c.removeAttribute('style');}

  if (c.nodeName.toLowerCase()=='script') {element_del(c);}
 }

 alert(new XMLSerializer().serializeToString(f));
 return f;
}


function element_del(element_id)
{
 if (document.getElementById(element_id))
 {
  document.getElementById(element_id).parentNode.removeChild(document.getElementById(element_id));
 }
 else if (element_id)
 {
  element_id.parentNode.removeChild(element_id);
 }
 else
 {
  alert('Error: the object or element \'' + element_id + '\' was not found and therefore could not be deleted.');
 }
}

javascript - 在客户端清理/重写 HTML

10 回答 10

Related

Reference