php - 努力从字符串中提取内容（PHP）

Question

我正在努力从字符串中提取内容（存储在数据库中）。每个div就是一个章节，h2内容就是标题。我想分别提取每章的标题和内容（div）

<p>
<div>
   <h2>Title 1</h2>
   Chapter Content 1 with standard html tags (ex: the following tags)
   <strong>aaaaaaaa</strong><br />
   <em>aaaaaaaaa</em><br />
   <u>aaaaaaaa</u><br />
   <span style="color:#00ffff"></span><br />
</div>
<div>
   <h2>Title 2</h2>
   Chapter Content 2
</div>
...
</p>

我在php中尝试过preg_match_all，但是当我有标准的html标签时它不起作用

function splitDescription($pDescr)
{
    $regex = "#<div.*?><h2.*?>(.*?)</h2>(.*?)</div>#";
    preg_match_all($regex, $pDescr, $result);

    return $result;
}

score 1 · Accepted Answer

在您尝试使用正则表达式解析 HTML 之前，我建议您阅读这篇文章。

您可以使用许多优秀的XML / HTML解析器。

score 1 · Accepted Answer

不要为此使用正则表达式，它不是该工作的正确工具。使用 HTML 解析器，例如 PHP 的DOMDocument：

libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXPath( $doc);

// For each <div> chapter
foreach( $xpath->query( '//div') as $chapter) {

    // Get the <h2> and save its inner value into $title
    $title_node = $xpath->query( 'h2', $chapter)->item( 0);
    $title = $title_node->textContent;

    // Remove the <h2>
    $chapter->removeChild( $title_node);

    // Save the rest of the <div> children in $content
    $content = '';
    foreach( $chapter->childNodes as $child) {
        $content .= $doc->saveHTML( $child);
    }
    echo "$title - " . htmlentities( $content) . "\n";
}

演示

php - 努力从字符串中提取内容（PHP）

2 回答 2

Related

Reference