php - 从 HTML 内容中删除脚本标签

Question

我正在使用 HTML Purifier (http://htmlpurifier.org/)

我只想删除<script>标签。我不想删除内联格式或任何其他内容。

我怎样才能做到这一点？

还有一件事，还有其他方法可以从 HTML 中删除脚本标签

score 152 · Accepted Answer

因为这个问题是用正则表达式标记的，所以在这种情况下，我将用穷人的解决方案来回答：

$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);

但是，正则表达式不适用于解析 HTML/XML，即使您编写了完美的表达式，它最终也会破坏，这不值得，尽管在某些情况下快速修复某些标记很有用，并且快速修复也是如此，忘记安全。仅对您信任的内容/标记使用正则表达式。

请记住，用户输入的任何内容都应被视为不安全。

这里更好的解决方案是使用DOMDocument为此设计的。这是一个片段，它展示了做同样的事情是多么容易、干净（与正则表达式相比）、（几乎）可靠和（几乎）安全：

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

我故意删除了 HTML，因为即使这样也可以。

score 39 · Accepted Answer

使用 PHPDOMDocument解析器。

$doc = new DOMDocument();

// load the HTML string we want to strip
$doc->loadHTML($html);

// get all the script tags
$script_tags = $doc->getElementsByTagName('script');

$length = $script_tags->length;

// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
  $script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}

// get the HTML string back
$no_script_html_string = $doc->saveHTML();

这让我使用了以下 HTML 文档：

<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>
            hey
        </title>
        <script>
            alert("hello");
        </script>
    </head>
    <body>
        hey
    </body>
</html>

请记住，DOMDocument解析器需要 PHP 5 或更高版本。

score 5 · Accepted Answer

$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags_to_remove = array('script','style','iframe','link');
foreach($tags_to_remove as $tag){
    $element = $dom->getElementsByTagName($tag);
    foreach($element  as $item){
        $item->parentNode->removeChild($item);
    }
}
$html = $dom->saveHTML();

score 4 · Accepted Answer

通过操作字符串的简单方法。

function stripStr($str, $ini, $fin)
{
    while (($pos = mb_stripos($str, $ini)) !== false) {
        $aux = mb_substr($str, $pos + mb_strlen($ini));
        $str = mb_substr($str, 0, $pos);
        
        if (($pos2 = mb_stripos($aux, $fin)) !== false) {
            $str .= mb_substr($aux, $pos2 + mb_strlen($fin));
        }
    }

    return $str;
}

score 3 · Accepted Answer

试试这个完整而灵活的解决方案。它运行良好，部分基于以前的一些答案，但包含额外的验证检查，并从函数中删除了额外的隐含HTML 。loadHTML(...)它分为两个独立的函数（一个具有先前的依赖项，因此不要重新排序/重新排列），因此您可以将它与您希望同时删除的多个 HTML 标签一起使用（即不仅仅是'script'标签）。例如removeAllInstancesOfTag(...)函数接受一个array标签名称，或者可选地只接受一个标签名称string。所以，不用多说，代码如下：


/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [BEGIN] */

/* Usage Example: $scriptless_html = removeAllInstancesOfTag($html, 'script'); */

if (!function_exists('removeAllInstancesOfTag'))
    {
        function removeAllInstancesOfTag($html, $tag_nm)
            {
                if (!empty($html))
                    {
                        $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'); /* For UTF-8 Compatibility. */
                        $doc = new DOMDocument();
                        $doc->loadHTML($html,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD|LIBXML_NOWARNING);

                        if (!empty($tag_nm))
                            {
                                if (is_array($tag_nm))
                                    {
                                        $tag_nms = $tag_nm;
                                        unset($tag_nm);

                                        foreach ($tag_nms as $tag_nm)
                                            {
                                                $rmvbl_itms = $doc->getElementsByTagName(strval($tag_nm));
                                                $rmvbl_itms_arr = [];

                                                foreach ($rmvbl_itms as $itm)
                                                    {
                                                        $rmvbl_itms_arr[] = $itm;
                                                    };

                                                foreach ($rmvbl_itms_arr as $itm)
                                                    {
                                                        $itm->parentNode->removeChild($itm);
                                                    };
                                            };
                                    }
                                else if (is_string($tag_nm))
                                    {
                                        $rmvbl_itms = $doc->getElementsByTagName($tag_nm);
                                        $rmvbl_itms_arr = [];

                                        foreach ($rmvbl_itms as $itm)
                                            {
                                                $rmvbl_itms_arr[] = $itm;
                                            };

                                        foreach ($rmvbl_itms_arr as $itm)
                                            {
                                                $itm->parentNode->removeChild($itm); 
                                            };
                                    };
                            };

                        return $doc->saveHTML();
                    }
                else
                    {
                        return '';
                    };
            };
    };

/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [END] */

/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [BEGIN] */

/* Prerequisites: 'removeAllInstancesOfTag(...)' */

if (!function_exists('removeAllScriptTags'))
    {
        function removeAllScriptTags($html)
            {
                return removeAllInstancesOfTag($html, 'script');
            };
    };

/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [END] */

这是一个测试使用示例：


$html = 'This is a JavaScript retention test.<br><br><span id="chk_frst_scrpt">Congratulations! The first \'script\' tag was successfully removed!</span><br><br><span id="chk_secd_scrpt">Congratulations! The second \'script\' tag was successfully removed!</span><script>document.getElementById("chk_frst_scrpt").innerHTML = "Oops! The first \'script\' tag was NOT removed!";</script><script>document.getElementById("chk_secd_scrpt").innerHTML = "Oops! The second \'script\' tag was NOT removed!";</script>';
echo removeAllScriptTags($html);

我希望我的回答真的对某人有所帮助。享受！

score 2 · Accepted Answer

更短：

$html = preg_replace("/<script.*?\/script>/s", "", $html);

当做正则表达式的时候，事情可能会出错，所以这样做更安全：

$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;

这样当“意外”发生时，我们会得到原始的 $html 而不是空字符串。

score 2 · Accepted Answer

这是ClandestineCoder和Binh WPO的合并。

脚本标签箭头的问题是它们可以有多个变体

前任。(< = <= &lt;) & ( > = >= &gt;)

因此，与其创建具有无数种变体的模式数组，不如恕我直言，更好的解决方案是

return preg_replace('/script.*?\/script/ius', '', $text)
       ? preg_replace('/script.*?\/script/ius', '', $text)
       : $text;

script.../script无论箭头代码/变体如何，这都会删除任何看起来像的东西，你可以在这里测试它https://regex101.com/r/lK6vS8/1

score 2 · Accepted Answer

修改 ctf0 答案的示例。这应该只执行一次 preg_replace 并且还检查错误并阻止正斜杠的字符代码。

$str = '<script> var a - 1; <&#47;script>'; 

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius';
$replace = preg_replace($pattern, '', $str); 
return ($replace !== null)? $replace : $str;

如果您使用的是 php 7，则可以使用 null coalesce 运算符来进一步简化它。

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius'; 
return (preg_replace($pattern, '', $str) ?? $str);

score 2 · Accepted Answer

function remove_script_tags($html){
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $script = $dom->getElementsByTagName('script');

    $remove = [];
    foreach($script as $item){
        $remove[] = $item;
    }

    foreach ($remove as $item){
        $item->parentNode->removeChild($item);
    }

    $html = $dom->saveHTML();
    $html = preg_replace('/<!DOCTYPE.*?<html>.*?<body><p>/ims', '', $html);
    $html = str_replace('</p></body></html>', '', $html);
    return $html;
}

Dejan 的回答很好，但是 saveHTML() 添加了不必要的 doctype 和 body 标签，这应该摆脱它。见https://3v4l.org/82FNP

score 1 · Accepted Answer

如果它可用，我会使用 BeautifulSoup。使这种事情变得非常容易。

不要试图用正则表达式来做。那就是疯狂。

score 1 · Accepted Answer

我一直在为这个问题而苦苦挣扎。我发现你真的只需要一个功能。爆炸('>', $html); 任何标签的唯一共同点是 < 和 >。然后通常是引号（ " ）。一旦找到共同点，您就可以轻松提取信息。这就是我想出的：

$html = file_get_contents('http://some_page.html');

$h = explode('>', $html);

foreach($h as $k => $v){

    $v = trim($v);//clean it up a bit

    if(preg_match('/^(<script[.*]*)/ius', $v)){//my regex here might be questionable

        $counter = $k;//match opening tag and start counter for backtrace

        }elseif(preg_match('/([.*]*<\/script$)/ius', $v)){//but it gets the job done

            $script_length = $k - $counter;

            $counter = 0;

            for($i = $script_length; $i >= 0; $i--){
                $h[$k-$i] = '';//backtrace and clear everything in between
                }
            }           
        }
for($i = 0; $i <= count($h); $i++){
    if($h[$i] != ''){
    $ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
        }
    }
$html = implode('>', $ht);//all scripts stripped.


echo $html;

我认为这真的只适用于脚本标签，因为你永远不会有嵌套的脚本标签。当然，您可以轻松添加更多执行相同检查和收集嵌套标签的代码。

我称之为手风琴编码。内爆（）；爆炸（）；如果你有一个共同点，这是让你的逻辑流畅的最简单的方法。

score 1 · Accepted Answer

这是 Dejan Marjanovic 答案的简化变体：

function removeTags($html, $tag) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    foreach (iterator_to_array($dom->getElementsByTagName($tag)) as $item) {
        $item->parentNode->removeChild($item);
    }
    return $dom->saveHTML();
}

可用于删除任何类型的标签，包括<script>：

$scriptlessHtml = removeTags($html, 'script');

score 1 · Accepted Answer

使用 str_replace 函数将它们替换为空白空间或其他东西

$query = '<script>console.log("I should be banned")</script>';

$badChar = array('<script>','</script>');
$query = str_replace($badChar, '', $query);

echo $query; 
//this echoes console.log("I should be banned")

?>

php - 从 HTML 内容中删除脚本标签

13 回答 13

Related

Reference