php - 使用 php preg_replace 来添加 src 值，无论 img 元素的格式有多糟糕

Question

我的 html 内容如下所示：

<div class="preload"><img src="PRODUCTPAGE_files/like_icon_u10_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/line_u14_line.png" width="1" height="1"/>

这是一条完整的长线，没有任何换行符分隔每个 img 元素，也没有任何缩进。

我使用的php代码如下：

/**
 *
 * Take in html content as string and find all the <script src="yada.js" ... >
 * and add $prepend to the src values except when there is http: or https:
 *
 * @param $html String The html content
 * @param $prepend String The prepend we expect in front of all the href in css tags
 * @return String The new $html content after find and replace. 
 * 
 */
    protected static function _prependAttrForTags($html, $prepend, $tag) {
        if ($tag == 'css') {
            $element = 'link';
            $attr = 'href';
        }
        else if ($tag == 'js') {
            $element = 'script';
            $attr = 'src';
        }
        else if ($tag == 'img') {
            $element = 'img';
            $attr = 'src';
        }
        else {
            // wrong tag so return unchanged
            return $html;
        }
        // this checks for all the "yada.*"
        $html = preg_replace('/(<'.$element.'\b.+'.$attr.'=")(?!http)([^"]*)(".*>)/', '$1'.$prepend.'$2$3$4', $html);
        // this checks for all the 'yada.*'
        $html = preg_replace('/(<'.$element.'\b.+'.$attr.'='."'".')(?!http)([^"]*)('."'".'.*>)/', '$1'.$prepend.'$2$3$4', $html);
        return $html;
    }
}

我希望我的函数能够正常工作，不管 img 元素的形成有多糟糕。

无论 src 属性的位置如何，它都必须工作。

它唯一应该做的就是在 src 值前面加上一些东西。

另请注意，如果 src 值以 http 开头，则不会发生此 preg_replace。

现在，我的代码只有在我的内容是：

<div class="preload">
    <img src="PRODUCTPAGE_files/like_icon_u10_normal.png" width="1" height="1"></img>
    <img src="PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/line_u14_line.png" width="1" height="1"/><img src="PRODUCTPAGE_files/line_u15_line.png" width="1" height="1"/>

正如您可能猜到的那样，它成功地做到了，但仅适用于第一个 img 元素，因为它进入下一行，并且在开始的 img 标记的末尾没有 /。

请告知如何改进我的功能。

更新：

我使用了 DOMDocument，它奏效了！在添加 src 值之后，我需要将其替换为 php 代码片段

所以原创：

<img src="PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1"/>

使用 DOMDocument 并添加我的前置字符串后：

<img src="prepended/PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1" />

现在我需要用以下内容替换整个内容：

<?php echo $this->Html->img('prepended/PRODUCTPAGE_files/read_icon_u12_normal.png', array('width'=>'1', height='1')); ?>

我还能使用 DOMDocument 吗？或者我需要使用 preg_replace？

score 1 · Accepted Answer

DomDocument 是为解析 HTML 而构建的，无论它多么混乱，而不是构建自己的 HTML 解析器，为什么不使用它呢？

结合DomDocument和XPath你可以这样做：

<?php
$html = <<<HTML
<script src="test"/><link href="test"/><div class="preload"><img src="PRODUCTPAGE_files/like_icon_u10_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/line_u14_line.png" width="1" height="1"/><img width="1" height="1" src="httpPRODUCTPAGE_files/line_u14_line.png"/>
HTML;

$doc = new DOMDocument();
@$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$searchTags = $xpath->query('//img | //link | //script');

$length = $searchTags->length;
for ($i = 0; $i < $length; $i++) {
    $element = $searchTags->item($i);

    if ($element->tagName == 'link')
        $attr = 'href';
    else
        $attr = 'src';

    $src = $element->getAttribute($attr);
    if (!startsWith($src, 'http'))
    {
        $element->setAttribute($attr, "whatever" . $src);
    }
}

// this small function will check the start of a string 
// with a given term, in your case http or http://
function startsWith($haystack, $needle)
{
    return !strncmp($haystack, $needle, strlen($needle));
}

$result = $doc->saveHTML();
echo $result;

这是它工作的现场演示。

如果你的 HTML 搞砸了，比如缺少结束标签等，你可以使用 before @$doc->loadHTML($html);：

$doc->recover = true;
$doc->strictErrorChecking = false;

如果您希望输出格式化，您可以使用 before @$doc->loadHTML($html);：

$doc->formatOutput = true;

使用 XPath，我们只捕获您需要编辑的数据，因此我们不必担心其他元素。

请记住，如果您的 HTML 缺少标签，例如body, html, doctype，head这将自动添加它，但是如果您已经拥有 em 它不应该做任何其他事情。

但是，如果您想删除它们，您可以使用以下内容，而不仅仅是$doc->saveHTML();：

$result = preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]*>\s*~i', '', $doc->saveHTML());

如果你想用一个新创建的元素替换元素，你可以使用这个：

$newElement = $doc->createElement($element->tagName, '');
$newElement->setAttribute($attr, "prepended/" . $src);
$myArrayWithAttributes = array ('width' => '1', 'height' => '1');
foreach ($myArrayWithAttributes as $attribute=>$value)
    $newElement->setAttribute($attribute, $value);
$element->parentNode->replaceChild($newElement, $element);

通过创建片段：

$frag = $doc->createDocumentFragment();
$frag->appendXML('<?php echo $this->Html->img("prepended/PRODUCTPAGE_files/read_icon_u12_normal.png", array("width"=>"1", "height"=>"1")); ?>');
$element->parentNode->replaceChild($frag, $element);

现场演示。

您可以使用tidy格式化 HTML ：

$tidy = tidy_parse_string($result, array(
    'indent' => TRUE,
    'output-xhtml' => TRUE,
    'indent-spaces' => 4
));
$tidy->cleanRepair();
echo $tidy;

php - 使用 php preg_replace 来添加 src 值，无论 img 元素的格式有多糟糕

1 回答 1

Related

Reference