php - 在 HTML 元素中防止 XSS

Question

以下内容足以防止 HTML 元素内部出现 XSS 吗？

function XSS_encode_html ( $str )
{
    $str = str_replace ( '&', "&amp;", $str );
    $str = str_replace ( '<', "&lt;", $str );
    $str = str_replace ( '>', "&gt;", $str );
    $str = str_replace ( '"', " &quot;", $str );
    $str = str_replace ( '\'', " &#x27;", $str );
    $str = str_replace ( '/', "&#x2F;", $str );

    return $str;
}

如此处所述： -
https://www.owasp.org/index.php/Abridged_XSS_Prevention_Cheat_Sheet#RULE_.231_-_HTML_Escape_Before_Inserting_Untrusted_Data_into_HTML_Element_Content

编辑

我没有使用 htmlspecialchars() 因为： -

它不会改变 / 到/
' （单引号）在设置 ENT_QUOTES 时变为 ' '' （或）。'

根据 OWASP，'（单引号）应该变成'（叫我迂腐）并且
'不推荐，因为它不在 HTML 规范中

score 5 · Accepted Answer

在元素的内容中，唯一可能有害的字符是开始标记分隔符<，因为它可能表示某些标记声明的开始，无论是开始标记、结束标记还是注释。所以这个角色应该总是被转义。

其他字符不一定需要在元素的内容中进行转义。

引号确实只需要在标签内进行转义，特别是当用于包裹在相同引号内或根本不被引用的属性值时。类似地，标记声明关闭分隔符>只需要在标签内进行转义，此处仅在未引用的属性值中使用时。但是，建议同时转义普通的 & 符号，以避免它们被错误地解释为字符引用的开始。

至于替换的原因/，可能是由于SGML中的一个特性，标记语言HTML改编自，它允许所谓的空结束标记：

要了解 null 结束标签在实践中的工作方式，请考虑将其与可以定义为的元素结合使用：
<!ELEMENT ISBN  - -  CDATA --ISBN number-- >
而不是输入 ISBN 号：
<ISBN>0 201 17535 5</ISBN>
我们可以使用 null end-tag 选项以缩短的形式输入元素：
<ISBN/0 201 17535 5/

但是，我从未见过任何浏览器实现过此功能。HTML 的语法规则一直比 SGML 语法规则更严格。

另一个更可能的原因是所谓的原始文本元素 (script和style)的内容模型，它是具有以下限制的纯文本：

原始文本和 RCDATA 元素中的文本不得包含任何出现的字符串“ </”（U+003C LESS-THAN SIGN, U+002F SOLIDUS），后跟不区分大小写匹配元素标签名称的字符，后跟以下之一“制表符”（U+0009）、“LF”（U+000A）、“FF”（U+000C）、“CR”（U+000D）、U+0020 空格、“ >”（U+003E）或" /" (U+002F)。

这里它说在原始文本元素内部，例如script出现的</script/将表示结束标记：

<script>
alert(0</script/.exec("script").index)
</script>

尽管 JavaScript 代码完全有效，但结束标记将由</script/. 但除此之外，/它不会造成任何伤害。而且，如果您只允许在 JavaScript 上下文中使用转义 HTML 的任意输入，那么您已经注定要失败了。

顺便说一句，转义这些字符的字符引用类型无关紧要，无论是命名字符引用（即实体引用）还是数字字符引用，无论是十进制还是十六进制。它们都引用了相同的字符。

score 2 · Accepted Answer

你应该使用htmlspecialchars：

$str = htmlspecialchars($str, ENT_QUOTES, 'UTF-8');

这是文档，基本上可以完成您的功能，但它已经实现并且更干净。但是，它不会转换斜杠和反斜杠。

如果要使用命名的 HTML 实体转换每个字符，可以使用htmlentities：

$str = htmlentities($str, ENT_QUOTES, 'UTF-8');

它记录在这里。如果您只想防止 XSS 攻击和 JS 注入，我会推荐前者，因为它的开销要低得多。

score 1 · Accepted Answer

好吧，这是一个很长的内容，但我觉得如果我不分享它会造成伤害。所有代码直接取自 Drupal 最新稳定版本的源代码的各个部分，并编译到一个区域（如下所示）。非常有效的防止 XSS 攻击的方法。

示例用法：

$html = file_get_contents('http://example.com');
$output = filter_xss($html);
print $output;

或者：

$html = file_get_contents('http://example.com');
// Allow only <ul></ul>, <li></li>, and <p></p> tags.
$allowed_tags = array('ul', 'li', 'p');
$output = filter_xss($html, $allowed_tags);
print $output;

这是运行上述示例所需的代码：

/**
 * Filters HTML to prevent cross-site-scripting (XSS) vulnerabilities.
 *
 * Based on kses by Ulf Harnhammar, see http://sourceforge.net/projects/kses.
 * For examples of various XSS attacks, see: http://ha.ckers.org/xss.html.
 *
 * This code does four things:
 * - Removes characters and constructs that can trick browsers.
 * - Makes sure all HTML entities are well-formed.
 * - Makes sure all HTML tags and attributes are well-formed.
 * - Makes sure no HTML tags contain URLs with a disallowed protocol (e.g.
 *   javascript:).
 *
 * @param $string
 *   The string with raw HTML in it. It will be stripped of everything that can
 *   cause an XSS attack.
 * @param $allowed_tags
 *   An array of allowed tags.
 *
 * @return
 *   An XSS safe version of $string, or an empty string if $string is not
 *   valid UTF-8.
 *
 * @see validate_utf8()
 * @ingroup sanitization
 */
function filter_xss($string, $allowed_tags = array('a', 'em', 'strong', 'cite', 'blockquote', 'code', 'ul', 'ol', 'li', 'dl', 'dt', 'dd')) {
  // Only operate on valid UTF-8 strings. This is necessary to prevent cross
  // site scripting issues on Internet Explorer 6.
  if (!validate_utf8($string)) {
    return '';
  }
  // Store the text format.
  _filter_xss_split($allowed_tags, TRUE);
  // Remove NULL characters (ignored by some browsers).
  $string = str_replace(chr(0), '', $string);
  // Remove Netscape 4 JS entities.
  $string = preg_replace('%&\s*\{[^}]*(\}\s*;?|$)%', '', $string);

  // Defuse all HTML entities.
  $string = str_replace('&', '&amp;', $string);
  // Change back only well-formed entities in our whitelist:
  // Decimal numeric entities.
  $string = preg_replace('/&amp;#([0-9]+;)/', '&#\1', $string);
  // Hexadecimal numeric entities.
  $string = preg_replace('/&amp;#[Xx]0*((?:[0-9A-Fa-f]{2})+;)/', '&#x\1', $string);
  // Named entities.
  $string = preg_replace('/&amp;([A-Za-z][A-Za-z0-9]*;)/', '&\1', $string);

  return preg_replace_callback('%
    (
    <(?=[^a-zA-Z!/])  # a lone <
    |                 # or
    <!--.*?-->        # a comment
    |                 # or
    <[^>]*(>|$)       # a string that starts with a <, up until the > or the end of the string
    |                 # or
    >                 # just a >
    )%x', '_filter_xss_split', $string);
}

/**
 * Processes an HTML tag.
 *
 * @param $m
 *   An array with various meaning depending on the value of $store.
 *   If $store is TRUE then the array contains the allowed tags.
 *   If $store is FALSE then the array has one element, the HTML tag to process.
 * @param $store
 *   Whether to store $m.
 *
 * @return
 *   If the element isn't allowed, an empty string. Otherwise, the cleaned up
 *   version of the HTML element.
 */
function _filter_xss_split($m, $store = FALSE) {
  static $allowed_html;

  if ($store) {
    $allowed_html = array_flip($m);
    return;
  }

  $string = $m[1];

  if (substr($string, 0, 1) != '<') {
    // We matched a lone ">" character.
    return '&gt;';
  }
  elseif (strlen($string) == 1) {
    // We matched a lone "<" character.
    return '&lt;';
  }

  if (!preg_match('%^<\s*(/\s*)?([a-zA-Z0-9]+)([^>]*)>?|(<!--.*?-->)$%', $string, $matches)) {
    // Seriously malformed.
    return '';
  }

  $slash = trim($matches[1]);
  $elem = &$matches[2];
  $attrlist = &$matches[3];
  $comment = &$matches[4];

  if ($comment) {
    $elem = '!--';
  }

  if (!isset($allowed_html[strtolower($elem)])) {
    // Disallowed HTML element.
    return '';
  }

  if ($comment) {
    return $comment;
  }

  if ($slash != '') {
    return "</$elem>";
  }

  // Is there a closing XHTML slash at the end of the attributes?
  $attrlist = preg_replace('%(\s?)/\s*$%', '\1', $attrlist, -1, $count);
  $xhtml_slash = $count ? ' /' : '';

  // Clean up attributes.
  $attr2 = implode(' ', _filter_xss_attributes($attrlist));
  $attr2 = preg_replace('/[<>]/', '', $attr2);
  $attr2 = strlen($attr2) ? ' ' . $attr2 : '';

  return "<$elem$attr2$xhtml_slash>";
}

/**
 * Processes a string of HTML attributes.
 *
 * @return
 *   Cleaned up version of the HTML attributes.
 */
function _filter_xss_attributes($attr) {
  $attrarr = array();
  $mode = 0;
  $attrname = '';

  while (strlen($attr) != 0) {
    // Was the last operation successful?
    $working = 0;

    switch ($mode) {
      case 0:
        // Attribute name, href for instance.
        if (preg_match('/^([-a-zA-Z]+)/', $attr, $match)) {
          $attrname = strtolower($match[1]);
          $skip = ($attrname == 'style' || substr($attrname, 0, 2) == 'on');
          $working = $mode = 1;
          $attr = preg_replace('/^[-a-zA-Z]+/', '', $attr);
        }
        break;

      case 1:
        // Equals sign or valueless ("selected").
        if (preg_match('/^\s*=\s*/', $attr)) {
          $working = 1; $mode = 2;
          $attr = preg_replace('/^\s*=\s*/', '', $attr);
          break;
        }

        if (preg_match('/^\s+/', $attr)) {
          $working = 1; $mode = 0;
          if (!$skip) {
            $attrarr[] = $attrname;
          }
          $attr = preg_replace('/^\s+/', '', $attr);
        }
        break;

      case 2:
        // Attribute value, a URL after href= for instance.
        if (preg_match('/^"([^"]*)"(\s+|$)/', $attr, $match)) {
          $thisval = filter_xss_bad_protocol($match[1]);

          if (!$skip) {
            $attrarr[] = "$attrname=\"$thisval\"";
          }
          $working = 1;
          $mode = 0;
          $attr = preg_replace('/^"[^"]*"(\s+|$)/', '', $attr);
          break;
        }

        if (preg_match("/^'([^']*)'(\s+|$)/", $attr, $match)) {
          $thisval = filter_xss_bad_protocol($match[1]);

          if (!$skip) {
            $attrarr[] = "$attrname='$thisval'";
          }
          $working = 1; $mode = 0;
          $attr = preg_replace("/^'[^']*'(\s+|$)/", '', $attr);
          break;
        }

        if (preg_match("%^([^\s\"']+)(\s+|$)%", $attr, $match)) {
          $thisval = filter_xss_bad_protocol($match[1]);

          if (!$skip) {
            $attrarr[] = "$attrname=\"$thisval\"";
          }
          $working = 1; $mode = 0;
          $attr = preg_replace("%^[^\s\"']+(\s+|$)%", '', $attr);
        }
        break;
    }

    if ($working == 0) {
      // Not well formed; remove and try again.
      $attr = preg_replace('/
        ^
        (
        "[^"]*("|$)     # - a string that starts with a double quote, up until the next double quote or the end of the string
        |               # or
        \'[^\']*(\'|$)| # - a string that starts with a quote, up until the next quote or the end of the string
        |               # or
        \S              # - a non-whitespace character
        )*              # any number of the above three
        \s*             # any number of whitespaces
        /x', '', $attr);
      $mode = 0;
    }
  }

  // The attribute list ends with a valueless attribute like "selected".
  if ($mode == 1 && !$skip) {
    $attrarr[] = $attrname;
  }
  return $attrarr;
}

/**
 * Processes an HTML attribute value and strips dangerous protocols from URLs.
 *
 * @param $string
 *   The string with the attribute value.
 * @param $decode
 *   (deprecated) Whether to decode entities in the $string. Set to FALSE if the
 *   $string is in plain text, TRUE otherwise. Defaults to TRUE.
 *
 * @return
 *   Cleaned up and HTML-escaped version of $string.
 */
function filter_xss_bad_protocol($string, $decode = TRUE) {
  // Get the plain text representation of the attribute value (i.e. its meaning).
  if ($decode) {

    $string = decode_entities($string);
  }
  return check_plain(strip_dangerous_protocols($string));
}

/**
 * Strips dangerous protocols (e.g. 'javascript:') from a URI.
 *
 * @param $uri
 *   A plain-text URI that might contain dangerous protocols.
 *
 * @return
 *   A plain-text URI stripped of dangerous protocols. As with all plain-text
 *   strings, this return value must not be output to an HTML page without
 *   check_plain() being called on it. However, it can be passed to functions
 *   expecting plain-text strings.
 *
 */
function strip_dangerous_protocols($uri) {
  static $allowed_protocols;

  if (!isset($allowed_protocols)) {
    $allowed_protocols = array_flip(array('ftp', 'http', 'https', 'irc', 'mailto', 'news', 'nntp', 'rtsp', 'sftp', 'ssh', 'tel', 'telnet', 'webcal'));
  }

  // Iteratively remove any invalid protocol found.
  do {
    $before = $uri;
    $colonpos = strpos($uri, ':');
    if ($colonpos > 0) {
      // We found a colon, possibly a protocol. Verify.
      $protocol = substr($uri, 0, $colonpos);
      // If a colon is preceded by a slash, question mark or hash, it cannot
      // possibly be part of the URL scheme. This must be a relative URL, which
      // inherits the (safe) protocol of the base document.
      if (preg_match('![/?#]!', $protocol)) {
        break;
      }
      // Check if this is a disallowed protocol. Per RFC2616, section 3.2.3
      // (URI Comparison) scheme comparison must be case-insensitive.
      if (!isset($allowed_protocols[strtolower($protocol)])) {
        $uri = substr($uri, $colonpos + 1);
      }
    }
  } while ($before != $uri);

  return $uri;
}

/**
 * Encodes special characters in a plain-text string for display as HTML.
 *
 * Also validates strings as UTF-8 to prevent cross site scripting attacks on
 * Internet Explorer 6.
 *
 * @param $text
 *   The text to be checked or processed.
 *
 * @return
 *   An HTML safe version of $text, or an empty string if $text is not
 *   valid UTF-8.
 *
 * @see validate_utf8()
 * @ingroup sanitization
 */
function check_plain($text) {
  return htmlspecialchars($text, ENT_QUOTES, 'UTF-8');
}

/**
 * Decodes all HTML entities (including numerical ones) to regular UTF-8 bytes.
 *
 * Double-escaped entities will only be decoded once ("&amp;lt;" becomes "&lt;"
 * , not "<"). Be careful when using this function, as decode_entities can
 * revert previous sanitization efforts (&lt;script&gt; will become <script>).
 *
 * @param $text
 *   The text to decode entities in.
 *
 * @return
 *   The input $text, with all HTML entities decoded once.
 */
function decode_entities($text) {
  return html_entity_decode($text, ENT_QUOTES, 'UTF-8');
}

/**
 * Checks whether a string is valid UTF-8.
 *
 * All functions designed to filter input should use validate_utf8
 * to ensure they operate on valid UTF-8 strings to prevent bypass of the
 * filter.
 *
 * When text containing an invalid UTF-8 lead byte (0xC0 - 0xFF) is presented
 * as UTF-8 to Internet Explorer 6, the program may misinterpret subsequent
 * bytes. When these subsequent bytes are HTML control characters such as
 * quotes or angle brackets, parts of the text that were deemed safe by filters
 * end up in locations that are potentially unsafe; An onerror attribute that
 * is outside of a tag, and thus deemed safe by a filter, can be interpreted
 * by the browser as if it were inside the tag.
 *
 * The function does not return FALSE for strings containing character codes
 * above U+10FFFF, even though these are prohibited by RFC 3629.
 *
 * @param $text
 *   The text to check.
 *
 * @return
 *   TRUE if the text is valid UTF-8, FALSE if not.
 */
function validate_utf8($text) {
  if (strlen($text) == 0) {
    return TRUE;
  }
  // With the PCRE_UTF8 modifier 'u', preg_match() fails silently on strings
  // containing invalid UTF-8 byte sequences. It does not reject character
  // codes above U+10FFFF (represented by 4 or more octets), though.
  return (preg_match('/^./us', $text) == 1);
}

score 0 · Accepted Answer

对于 perl 脚本或 cgi，您可以使用HTML::Entities：

use HTML::Entities;

$str = encode_entities($str, '<>&"');

score -1 · Accepted Answer

-1

您可以使用 stripslashes() 函数。

    $str = stripslashes($str);

于 2013-04-20T22:31:40.547 回答

php - 在 HTML 元素中防止 XSS

5 回答 5

Related

Reference