php - 删除父元素，使用 saveHTML 将所有内部子元素保留在 DOMDocument 中

Question

我正在使用 XPath 处理一个简短的 HTML 片段；当我使用 $doc->saveHTML() 将更改的代码段输出回来时，DOCTYPE会被添加，并且HTML / BODY标签会包装输出。我想删除这些，但只使用 DOMDocument 函数将所有子项保留在里面。例如：

$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p>
<a href="http://www....."><img src="http://" alt=""></a>
<p>...to be one of those crowning achievements...</p>');
// manipulation goes here
echo htmlentities( $doc->saveHTML() );

这会产生：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html><body>
<p><strong>Title...</strong></p>
<a href="http://www....."><img src="http://" alt=""></a>
<p>...to be one of those crowning achievements...</p>
</body></html>

我尝试了一些简单的技巧，例如：

# removes doctype
$doc->removeChild($doc->firstChild);

# <body> replaces <html>
$doc->replaceChild($doc->firstChild->firstChild, $doc->firstChild);

到目前为止，这只删除了 DOCTYPE 并将 HTML 替换为 BODY。但是，此时剩下的是 body > 可变数量的元素。

我如何删除<body>标签但保留它的所有子标签，因为它们的结构是可变的，通过 PHP 的 DOM 操作以简洁干净的方式？

score 16 · Accepted Answer

更新

这是一个不扩展 DOMDocument 的版本，尽管我认为扩展是正确的方法，因为您正在尝试实现 DOM API 未内置的功能。

注意：我将“干净”和“没有变通办法”解释为保持对 DOM API 的所有操作。一旦您进行字符串操作，这就是解决方法领域。

就像在原始答案中一样，我正在做的是利用 DOMDocumentFragment 来操作都位于根级别的多个节点。没有进行字符串操作，对我来说这不是一种解决方法。

$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p><a href="http://www....."><img src="http://" alt=""></a><p>...to be one of those crowning achievements...</p>');

// Remove doctype node
$doc->doctype->parentNode->removeChild($doc->doctype);

// Remove html element, preserving child nodes
$html = $doc->getElementsByTagName("html")->item(0);
$fragment = $doc->createDocumentFragment();
while ($html->childNodes->length > 0) {
    $fragment->appendChild($html->childNodes->item(0));
}
$html->parentNode->replaceChild($fragment, $html);

// Remove body element, preserving child nodes
$body = $doc->getElementsByTagName("body")->item(0);
$fragment = $doc->createDocumentFragment();
while ($body->childNodes->length > 0) {
    $fragment->appendChild($body->childNodes->item(0));
}
$body->parentNode->replaceChild($fragment, $body);

// Output results
echo htmlentities($doc->saveHTML());

原始答案

这个解决方案相当冗长，但这是因为它通过扩展 DOM 来实现它，以使您的最终代码尽可能短。

sliceOutNode是魔法发生的地方。如果您有任何问题，请告诉我：

<?php

class DOMDocumentExtended extends DOMDocument
{
    public function __construct( $version = "1.0", $encoding = "UTF-8" )
    {
        parent::__construct( $version, $encoding );

        $this->registerNodeClass( "DOMElement", "DOMElementExtended" );
    }

    // This method will need to be removed once PHP supports LIBXML_NOXMLDECL
    public function saveXML( DOMNode $node = NULL, $options = 0 )
    {
        $xml = parent::saveXML( $node, $options );

        if( $options & LIBXML_NOXMLDECL )
        {
            $xml = $this->stripXMLDeclaration( $xml );
        }

        return $xml;
    }

    public function stripXMLDeclaration( $xml )
    {
        return preg_replace( "|<\?xml(.+?)\?>[\n\r]?|i", "", $xml );
    }
}

class DOMElementExtended extends DOMElement
{
    public function sliceOutNode()
    {
        $nodeList = new DOMNodeListExtended( $this->childNodes );
        $this->replaceNodeWithNode( $nodeList->toFragment( $this->ownerDocument ) );
    }

    public function replaceNodeWithNode( DOMNode $node )
    {
        return $this->parentNode->replaceChild( $node, $this );
    }
}

class DOMNodeListExtended extends ArrayObject
{
    public function __construct( $mixedNodeList )
    {
        parent::__construct( array() );

        $this->setNodeList( $mixedNodeList );
    }

    private function setNodeList( $mixedNodeList )
    {
        if( $mixedNodeList instanceof DOMNodeList )
        {
            $this->exchangeArray( array() );

            foreach( $mixedNodeList as $node )
            {
                $this->append( $node );
            }
        }
        elseif( is_array( $mixedNodeList ) )
        {
            $this->exchangeArray( $mixedNodeList );
        }
        else
        {
            throw new DOMException( "DOMNodeListExtended only supports a DOMNodeList or array as its constructor parameter." );
        }
    }

    public function toFragment( DOMDocument $contextDocument )
    {
        $fragment = $contextDocument->createDocumentFragment();

        foreach( $this as $node )
        {
            $fragment->appendChild( $contextDocument->importNode( $node, true ) );
        }

        return $fragment;
    }

    // Built-in methods of the original DOMNodeList

    public function item( $index )
    {
        return $this->offsetGet( $index );
    }

    public function __get( $name )
    {
        switch( $name )
        {
            case "length":
                return $this->count();
            break;
        }

        return false;
    }
}

// Load HTML/XML using our fancy DOMDocumentExtended class
$doc = new DOMDocumentExtended();
$doc->loadHTML('<p><strong>Title...</strong></p><a href="http://www....."><img src="http://" alt=""></a><p>...to be one of those crowning achievements...</p>');

// Remove doctype node
$doc->doctype->parentNode->removeChild( $doc->doctype );

// Slice out html node
$html = $doc->getElementsByTagName("html")->item(0);
$html->sliceOutNode();

// Slice out body node
$body = $doc->getElementsByTagName("body")->item(0);
$body->sliceOutNode();

// Pick your poison: XML or HTML output
echo htmlentities( $doc->saveXML( NULL, LIBXML_NOXMLDECL ) );
echo htmlentities( $doc->saveHTML() );

score 11 · Accepted Answer

saveHTML可以输出文档的一个子集，这意味着我们可以要求它通过遍历body来一个一个地输出每个子节点。

$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p>
<a href="http://google.com"><img src="http://google.com/img.jpeg" alt=""></a>
<p>...to be one of those crowning achievements...</p>');
// manipulation goes here

// Let's traverse the body and output every child node
$bodyNode = $doc->getElementsByTagName('body')->item(0);
foreach ($bodyNode->childNodes as $childNode) {
  echo $doc->saveHTML($childNode);
}

这可能不是一个最优雅的解决方案，但它确实有效。或者，我们可以将所有子节点包装在某个容器元素（例如 a div）中并仅输出该容器（但容器标签将包含在输出中）。

score 2 · Accepted Answer

这是我如何做到的：

-- 快速帮助函数，为您提供特定 DOM 元素的 HTML 内容

功能节点内容（$n，$outer=false）{
   $d = new DOMDocument('1.0');
   $b = $d->importNode($n->cloneNode(true),true);
   $d->appendChild($b); $h = $d->保存HTML();
   // 移除外部标签
   if (!$outer) $h = substr($h,strpos($h,'>')+1,-(strlen($n->nodeName)+4));
   返回 $h;
}

-- 在您的文档中查找正文节点并获取其内容

$query = $xpath->query("//body")->item(0);
如果（$查询）
{
    回显节点内容（$查询）；
}

更新 1：

一些额外信息：从 PHP/5.3.6 开始，DOMDocument->saveHTML() 接受一个可选的 DOMNode 参数，类似于 DOMDocument->saveXML()。你可以做

$xpath = 新 DOMXPath($doc);
$query = $xpath->query("//body")->item(0);
回声 $doc->saveHTML($query);

对于其他人，辅助功能会有所帮助

score 0 · Accepted Answer

tl;博士

要求：PHP 5.4.0和Libxml 2.6.0

$doc->loadHTML("<p>test</p>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

解释

http://php.net/manual/en/domdocument.loadhtml.php "从 PHP 5.4.0 和 Libxml 2.6.0 开始，您还可以使用 options 参数来指定额外的 Libxml 参数。 "

LIBXML_HTML_NOIMPLIED设置 HTML_PARSE_NOIMPLIED 标志，关闭隐含的 html/body... 元素的自动添加。

LIBXML_HTML_NODEFDTD设置 HTML_PARSE_NODEFDTD 标志，防止在找不到默认文档类型时添加默认文档类型。

score -1 · Accepted Answer

您有 2 种方法来完成此操作：

$content = substr($content, strpos($content, '<html><body>') + 12); // Remove Everything Before & Including The Opening HTML & Body Tags.
$content = substr($content, 0, -14); // Remove Everything After & Including The Closing HTML & Body Tags.

或者更好的是这样：

$dom->normalizeDocument();
$content = $dom->saveHTML();

php - 删除父元素，使用 saveHTML 将所有内部子元素保留在 DOMDocument 中

5 回答 5

更新

原始答案

Related

Reference