1

我准备了一个允许样式的白名单,我想从 HTML 字符串的白名单中删除所有样式

$allowed_styles = array('font-size','color','font-family','text-align','margin-left');
$html = 'xyz html';
$html_string = '<bdoy>' . $html . '<body>';
$dom = new DOMDocument();
$dom->loadHTML($html_string);
$elements = $dom->getElementsByTagName('body');
foreach($elements as $element) {

foreach($element->childNodes as $child) {

if($child->hasAttribute('style')) {

$style = strtolower(trim($child->getAttribute('style')));

//match and get only the CSS Property name
preg_match_all('/(?<names>[a-z\-]+):/', $style, $matches);

for($i=0;$i<sizeof($matches["names"]);$i++) {

  $style_property = $matches["names"][$i];

  // if the css-property is not in allowed styles array
  // then remove the whole style tag from this child

  if(!in_array($style_property,$allowed_styles)) {

   $child->removeAttribute('style');
   continue;

   }

}

    }
  }
}

$dom->saveHTML();
$html_output = $dom->getElementsByTagName('body');

我已经测试了这么多的 html 字符串,它在任何地方都可以正常工作。但是当我试图过滤这个 html 字符串时

$html_string = ​'<div style="font-style: italic; text-align: center; 
background-color: red;">On The Contrary</div><span 
style="font-style: italic; background-color: rgb(244, 249, 255); 
font-size: 32px;"><b style="text-align: center; 
background-color: rgb(255, 255, 255);">This is USA</b></span>';

除此行外,所有其他不允许的样式都从此字符串中删除

<b style="text-align: center; background-color: rgb(255, 255, 255);">

有人可以告诉我除白名单以外的任何其他有效且强大的删除样式的方法

4

2 回答 2

1

对于这个(和其他嵌套的)html,你必须使用这样的递归函数:

$html = 'your html';
$allowed_styles = array('font-size','color','font-family','text-align','margin-left');
$html_string = '<body>' . $html . '</body>';
$dom = new DOMDocument();
$dom->loadHTML($html_string);
$elements = $dom->getElementsByTagName('body');
foreach ($elements as $element)
    clearHtml($element, $allowed_styles);
$html_output = $dom->saveHTML(); 

function clearHtml($tree, $allowed_styles) {
    if ($tree->nodeType != XML_TEXT_NODE) {
        if ($tree->hasAttribute('style')) {
            $style = strtolower(trim($tree->getAttribute('style')));
            preg_match_all('/(?<names>[a-z\-]+):/', $style, $matches);
            for($i = 0; $i < sizeof($matches['names']); $i++) {
                $style_property = $matches['names'][$i];
                if(!in_array($style_property, $allowed_styles)) {
                    $tree->removeAttribute('style');
                    continue;
                }
            }
        }
        if ($tree->childNodes)
            foreach ($tree->childNodes as $child)
                clearHtml($child, $allowed_styles);
    }
}
于 2013-03-29T16:08:56.030 回答
1

与 Oleja 解决方案类似,但此解决方案仅删除不允许的属性,而不是整个样式属性。

//$this->removeStylesheet($doc, ['color','font-weight']);

function removeStylesheet($tree, $allowed_styles) {
    if ($tree->nodeType != XML_TEXT_NODE) {
        if ($tree->hasAttribute('style')) {
            $style = strtolower(trim($tree->getAttribute('style')));
            preg_match_all('/(?<names>[a-z\-]+) *:(?<values>[^\'";]+)/', $style, $matches);
            $new_styles = array();
            for ($i=0; $i<sizeof($matches['names']); $i++) {
                if(in_array($matches['names'][$i], $allowed_styles)) {
                    $new_styles[] = $matches['names'][$i].':'.$matches['values'][$i];
                }
            }
            if ($new_styles)
                $tree->setAttribute('style', implode(';', $new_styles));
            else
                $tree->removeAttribute('style');
        }
        if ($tree->childNodes) {
            foreach ($tree->childNodes as $child) {
                $this->removeStylesheet($child, $allowed_styles);
            }
        }
    }
}
于 2019-10-29T15:18:27.590 回答