这是我处理的代码。
<?php
$content_old = <<<'EOM'
<p> </p>
<p>lol<strong>test</strong></p>
<p><strong>This is a header</strong></p>
<p>Content content blah blah blah.</p>
EOM;
$content = preg_replace("/<p[^>]*>[\s| ]*<\/p>/", '', $content_old);
$doc = new DOMDocument;
$doc->loadHTML($content);
$xp = new DOMXPath($doc);
foreach ($xp->query('//p/strong') as $node) {
$parent = $node->parentNode;
if ($parent->textContent == $node->textContent &&
str_word_count($node->textContent) <= 8) {
$header = $doc->createElement('h2');
$parent->parentNode->replaceChild($header, $parent);
$header->appendChild($doc->createTextNode( $node->textContent ));
}
}
// just using saveXML() is not good enough, because it adds random html tags
$xp = new DOMXPath($doc);
$everything = $xp->query("body/*"); // retrieves all elements inside body tag
$output = '';
if ($everything->length > 0) { // check if it retrieved anything in there
foreach ($everything as $thing) {
$output .= $doc->saveXML($thing) . "\n";
}
};
echo "--- ORIGINAL --\n\n";
echo $content_old;
echo "\n\n--- UPDATED ---\n\n";
echo $output;
当我运行脚本时,这是我得到的输出:
--- ORIGINAL --
<p> </p>
<p>lol<strong>test</strong></p>
<p><strong>This is a header</strong></p>
<p>Content content blah blah blah.</p>
--- UPDATED ---
<p>lol<strong>test</strong></p>
<h2>This is a header</h2>
<p>Content content blah blah blah.</p>
更新#1
如果标签内有其他标签<p><strong>
(例如,<p><strong><a>
),那么整个<p>
将被替换,这不是我的意图,这是毫无价值的。
通过将 if 更改为以下内容可以轻松解决此问题:
if ($parent->textContent == $node->textContent &&
str_word_count($node->textContent) <= 8 &&
$node->childNodes->item(0)->nodeType == XML_TEXT_NODE) {
更新#2
还值得注意的是,如果<p><strong>
包含的 HTML 字符中的内容应该被转义(例如&
),则原始 createElement 会导致问题。
旧代码是:
$header = $doc->createElement('h2', $node->textContent);
$parent->parentNode->replaceChild($header, $parent);
新代码(正常工作)是:
$header = $doc->createElement('h2');
$parent->parentNode->replaceChild($header, $parent);
$header->appendChild($doc->createTextNode( $node->textContent ));