php - 在不拆分单词或破坏 html 标签的情况下缩短文本

Question

我试图在不将单词切成两半并保留 html 标签的情况下截断 236 个字符后的文本。这就是我现在正在使用的：

$shortdesc = $_helper->productAttribute($_product, $_product->getShortDescription(), 'short_description');
$lenght = 236;
echo substr($shortdesc, 0, strrpos(substr($shortdesc, 0, $lenght), " "));

虽然这在大多数情况下都有效，但它不会尊重 html 标签。例如这段文字：

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong>

将在标签仍然打开的情况下被切断。有没有办法在 236 个字符后截断文本但尊重 html 标签？

score 19 · Accepted Answer

我遇到的最佳解决方案来自 CakePHP 框架 TextHelper 类

这是方法

/**
* Truncates text.
*
* Cuts a string to the length of $length and replaces the last characters
* with the ending if the text is longer than length.
*
* ### Options:
*
* - `ending` Will be used as Ending and appended to the trimmed string
* - `exact` If false, $text will not be cut mid-word
* - `html` If true, HTML tags would be handled correctly
*
* @param string  $text String to truncate.
* @param integer $length Length of returned string, including ellipsis.
* @param array $options An array of html attributes and options.
* @return string Trimmed string.
* @access public
* @link http://book.cakephp.org/view/1469/Text#truncate-1625
*/
function truncate($text, $length = 100, $options = array()) {
    $default = array(
        'ending' => '...', 'exact' => true, 'html' => false
    );
    $options = array_merge($default, $options);
    extract($options);

    if ($html) {
        if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
            return $text;
        }
        $totalLength = mb_strlen(strip_tags($ending));
        $openTags = array();
        $truncate = '';

        preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
        foreach ($tags as $tag) {
            if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
                if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
                    array_unshift($openTags, $tag[2]);
                } else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
                    $pos = array_search($closeTag[1], $openTags);
                    if ($pos !== false) {
                        array_splice($openTags, $pos, 1);
                    }
                }
            }
            $truncate .= $tag[1];

            $contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
            if ($contentLength + $totalLength > $length) {
                $left = $length - $totalLength;
                $entitiesLength = 0;
                if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
                    foreach ($entities[0] as $entity) {
                        if ($entity[1] + 1 - $entitiesLength <= $left) {
                            $left--;
                            $entitiesLength += mb_strlen($entity[0]);
                        } else {
                            break;
                        }
                    }
                }

                $truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
                break;
            } else {
                $truncate .= $tag[3];
                $totalLength += $contentLength;
            }
            if ($totalLength >= $length) {
                break;
            }
        }
    } else {
        if (mb_strlen($text) <= $length) {
            return $text;
        } else {
            $truncate = mb_substr($text, 0, $length - mb_strlen($ending));
        }
    }
    if (!$exact) {
        $spacepos = mb_strrpos($truncate, ' ');
        if (isset($spacepos)) {
            if ($html) {
                $bits = mb_substr($truncate, $spacepos);
                preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
                if (!empty($droppedTags)) {
                    foreach ($droppedTags as $closingTag) {
                        if (!in_array($closingTag[1], $openTags)) {
                            array_unshift($openTags, $closingTag[1]);
                        }
                    }
                }
            }
            $truncate = mb_substr($truncate, 0, $spacepos);
        }
    }
    $truncate .= $ending;

    if ($html) {
        foreach ($openTags as $tag) {
            $truncate .= '</'.$tag.'>';
        }
    }

    return $truncate;
}

其他框架可能对此问题有类似（或不同）的解决方案，因此您也可以查看它们。我对 Cake 的熟悉促使我链接到他们的解决方案

编辑：

刚刚在我正在使用 OP 文本的应用程序中测试了这种方法

<?php 
echo truncate(
'Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong>', 
236, 
array('html' => true, 'ending' => '')); 
?>

输出：

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubegre</strong>

请注意，输出在完成最后一个单词之前停止，但包含完整的强标签

score 15 · Accepted Answer

这应该这样做：

class Html
{
    protected
        $reachedLimit = false,
        $totalLen = 0,
        $maxLen = 25,
        $toRemove = array();

    public static function trim($html, $maxLen = 25)
    {

        $dom = new DomDocument();

        if (version_compare(PHP_VERSION, '5.4.0') < 0) {
            $dom->loadHTML($html);
        } else {
            $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
        }

        $instance = new static();
        $toRemove = $instance->walk($dom, $maxLen);

        // remove any nodes that exceed limit
        foreach ($toRemove as $child) {
            $child->parentNode->removeChild($child);
        }

        // remove wrapper tags added by DD (doctype, html...)
        if (version_compare(PHP_VERSION, '5.4.0') < 0) {
            // http://stackoverflow.com/a/6953808/1058140
            $dom->removeChild($dom->firstChild);
            $dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild);

            return $dom->saveHTML();
        }

        return $dom->saveHTML();
    }

    protected function walk(DomNode $node, $maxLen)
    {

        if ($this->reachedLimit) {
            $this->toRemove[] = $node;
        } else {
            // only text nodes should have text,
            // so do the splitting here
            if ($node instanceof DomText) {
                $this->totalLen += $nodeLen = strlen($node->nodeValue);

                // use mb_strlen / mb_substr for UTF-8 support
                if ($this->totalLen > $maxLen) {
                    $node->nodeValue = substr($node->nodeValue, 0, $nodeLen - ($this->totalLen - $maxLen)) . '...';
                    $this->reachedLimit = true;
                }
            }

            // if node has children, walk its child elements
            if (isset($node->childNodes)) {
                foreach ($node->childNodes as $child) {
                    $this->walk($child, $maxLen);
                }
            }
        }

        return $this->toRemove;
    }
}

像这样使用：$str = Html::trim($str, 236);

（在这里演示）

这和 cakePHP 的正则表达式解决方案之间的一些性能比较

在此处输入图像描述

几乎没有区别，而且在非常大的字符串大小下，DomDocument 实际上更快。在我看来，可靠性比节省几微秒更重要。

score 1 · Accepted Answer

我可以考虑一下吗？

示例文本：

Lorem ipsum dolor sit amet, <i class="red">magna aliquyam erat</i>, duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong> hello

首先，将其解析为：

array(
    '0' => array(
        'tag' => '',
        'text' => 'Lorem ipsum dolor sit amet, '
    ),
    '1' => array(
        'tag' => '<i class="red">',
        'text' => 'magna aliquyam erat',
    )
    '2' => ......
    '3' => ......
)

然后将文本逐个剪切，并在剪切后用标签包裹每个文本，

然后加入他们。

score 0 · Accepted Answer

这将适用于 Unicode（来自@nice ass answer）：

class Html
{
    protected
        $reachedLimit = false,
        $totalLen = 0,
        $maxLen = 25,
        $toRemove = [];

    public static function trim($html, $maxLen = 25)
    {

        $dom = new \DOMDocument();
        $dom->loadHTML('<?xml encoding="UTF-8">' . $html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

        $instance = new static();
        $toRemove = $instance->walk($dom, $maxLen);

        // remove any nodes that exceed limit
        foreach ($toRemove as $child) {
            $child->parentNode->removeChild($child);
        }

        return $dom->saveHTML();
    }

    protected function walk(\DOMNode $node, $maxLen)
    {

        if ($this->reachedLimit) {
            $this->toRemove[] = $node;
        } else {
            // only text nodes should have text,
            // so do the splitting here
            if ($node instanceof \DOMText) {
                $this->totalLen += $nodeLen = mb_strlen($node->nodeValue);

                // use mb_strlen / mb_substr for UTF-8 support
                if ($this->totalLen > $maxLen) {
                    dump($node->nodeValue);
                    $node->nodeValue = mb_substr($node->nodeValue, 0, $nodeLen - ($this->totalLen - $maxLen)) . '...';
                    $this->reachedLimit = true;
                }
            }

            // if node has children, walk its child elements
            if (isset($node->childNodes)) {
                foreach ($node->childNodes as $child) {
                    $this->walk($child, $maxLen);
                }
            }
        }

        return $this->toRemove;
    }
}

score -1 · Accepted Answer

function limitStrlen($input, $length, $ellipses = true, $strip_html = true, $skip_html) 
{
    // strip tags, if desired
    if ($strip_html || !$skip_html) 
    {
        $input = strip_tags($input);

        // no need to trim, already shorter than trim length
        if (strlen($input) <= $length) 
        {
            return $input;
        }

        //find last space within length
        $last_space = strrpos(substr($input, 0, $length), ' ');
        if($last_space !== false) 
        {
            $trimmed_text = substr($input, 0, $last_space);
        } 
        else 
        {
            $trimmed_text = substr($input, 0, $length);
        }
    } 
    else 
    {
        if (strlen(strip_tags($input)) <= $length) 
        {
            return $input;
        }

        $trimmed_text = $input;

        $last_space = $length + 1;

        while(true)
        {
            $last_space = strrpos($trimmed_text, ' ');

            if($last_space !== false) 
            {
                $trimmed_text = substr($trimmed_text, 0, $last_space);

                if (strlen(strip_tags($trimmed_text)) <= $length) 
                {
                    break;
                }
            } 
            else 
            {
                $trimmed_text = substr($trimmed_text, 0, $length);
                break;
            }
        }

        // close unclosed tags.
        $doc = new DOMDocument();
        $doc->loadHTML($trimmed_text);
        $trimmed_text = $doc->saveHTML();
    }

    // add ellipses (...)
    if ($ellipses) 
    {
        $trimmed_text .= '...';
    }

    return $trimmed_text;
}

$str = "<h1><strong><span>Lorem</span></strong> <i>ipsum</i> <p class='some-class'>dolor</p> sit amet, consetetur.</h1>";

// view the HTML
echo htmlentities(limitStrlen($str, 22, false, false, true), ENT_COMPAT, 'UTF-8');

// view the result
echo limitStrlen($str, 22, false, false, true);

注意：可能有更好的方法来关闭标签，而不是使用DOMDocument. 例如，我们可以在 a 中使用 a p tag，h1 tag它仍然可以工作。但在这种情况下，标题标签将在之前关闭，p tag因为理论上不可能在p tag其中使用。因此，请注意 HTML 的严格标准。

score -2 · Accepted Answer

这是 JS 解决方案：trim-html

这个想法是以这种方式拆分 HTML 字符串，使数组的元素为 html 标签（打开或关闭）或只是字符串。

var arr = html.replace(/</g, "\n<")
              .replace(/>/g, ">\n")
              .replace(/\n\n/g, "\n")
              .replace(/^\n/g, "")
              .replace(/\n$/g, "")
              .split("\n");

比我们可以遍历数组和计数字符。

score -2 · Accepted Answer

我在 JS 中做过，希望这个逻辑对 PHP 也有帮助..

splitText : function(content, count){
        var originalContent = content;
         content = content.substring(0, count);
          //If there is no occurance of matches before breaking point and the hit breakes in between html tags.
         if (content.lastIndexOf("<") > content.lastIndexOf(">")){
            content = content.substring(0, content.lastIndexOf('<'));
            count = content.length;
            if(originalContent.indexOf("</", count)!=-1){
                content += originalContent.substring(count, originalContent.indexOf('>', originalContent.indexOf("</", count))+1);
            }else{
                 content += originalContent.substring(count, originalContent.indexOf('>', count)+1);
            }
          //If the breaking point is in between tags.
         }else if(content.lastIndexOf("<") != content.lastIndexOf("</")){
            content = originalContent.substring(0, originalContent.indexOf('>', count)+1);
         }
        return content;
    },

希望这个逻辑可以帮助一些人..

score -2 · Accepted Answer

您可以采用 XML 方法并将元素推送到字符串 var，直到字符串的长度超过 236

示例代码？

for each node // text or tag
  push to the string var

  if string length > 236
    break

endfor

用于在 PHP http://simplehtmldom.sourceforge.net/中解析 HTML

php - 在不拆分单词或破坏 html 标签的情况下缩短文本

8 回答 8

这和 cakePHP 的正则表达式解决方案之间的一些性能比较

Related

Reference