php - 使用 PHP 从 div 类中提取所有内容（包括 HTML）

Question

示例 HTML...

<html>
<head></head>
<body>
<table>
<tr>
    <td class="rsheader"><b>Header Content</b></td>
</tr>
<tr>
    <td class="rstext">Some text (Most likely will contain lots of HTML</td>
</tr>
</table>
</body>
</html>

我需要将 HTML 页面转换为该 HTML 页面的模板版本。HTML 页面由几个框组成，每个框都有一个标题（在上面的代码中称为“rsheader”）和一些文本（在上面的代码中称为“rstext”）。

我正在尝试编写一个 PHP 脚本来检索可能使用 file_get_contents 的 HTML 页面，然后提取 rsheader 和 rstext div 中的任何内容。基本上我不知道怎么做！我尝试过使用 DOM，但我不太了解它，虽然我确实设法提取了文本，但它忽略了任何 HTML。

我的PHP...

<?php

$html = '<html>
<head></head>
<body>
<table>
<tr>
    <td class="rsheader"><b>Header Content</b></td>
</tr>
<tr>
    <td class="rstext">Some text (Most likely will contain lots of HTML</td>
</tr>
</table>
</body>
</html>';

$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[@class="rsheader"]')->item(0);
echo $div->textContent;

?>

如果我做一个 print_r($div) 我会看到这个......

DOMElement Object
    (
    [tagName] => td
    [schemaTypeInfo] => 
    [nodeName] => td
    [nodeValue] => Header Content
    [nodeType] => 1
    [parentNode] => (object value omitted)
    [childNodes] => (object value omitted)
    [firstChild] => (object value omitted)
    [lastChild] => (object value omitted)
    [previousSibling] => 
    [nextSibling] => (object value omitted)
    [attributes] => (object value omitted)
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => td
    [baseURI] => 
    [textContent] => Header Content
    )

如您所见， textContent 节点中没有 HTML 标记，这让我相信我的做法是错误的 :(

真的希望有人可以给我一些帮助...

提前致谢

保罗

score 2 · Accepted Answer

X-Path 可能比您执行此任务所需的更多。我会尝试改用 DOMDocument 的getElementById() 方法。下面是一个例子，改编自这篇文章。

注意：更新为使用标签和类名而不是元素 ID。

function getChildHtml( $node ) 
{
    $innerHtml= '';
    $children = $node->childNodes;

    foreach( $children as $child )
    {
        $innerHtml .= sprintf( '%s%s', $innerHtml, $child->ownerDocument->saveXML( $child ) );
    }

    return $innerHtml;
}

$dom = new DomDocument();
$dom->loadHtml( $html );

// Gather all table cells in the document.
$cells = $dom->getElementsByTagName( 'td' );

// Loop through the collected table cells looking for those of class 'rsheader' or 'rstext'.
foreach( $cells as $cell )
{
    if( $cell->getAttribute( 'class' ) == 'rsheader' )
    {
        $headerHtml = getChildHtml( $cell );
        // Do something with header html.
    }

    if( $cell->getAttribute( 'class' ) == 'rstext' )
    {
        $textHtml = getChildHtml( $cell );
        // Do something with text html.
    }
}

score 0 · Accepted Answer

查看此答案并将其用作指南：从网站检索特定数据

如果您需要详细的帮助，我在这里为您提供帮助。

php - 使用 PHP 从 div 类中提取所有内容（包括 HTML）

2 回答 2

Related

Reference