4

使用 PHP 中的普通substr()函数,您可以决定从哪里“开始”切割字符串,以及设置长度。长度可能是使用最多的,但在这种情况下,我需要从头开始剪掉大约 120 个字符。问题是我需要保持字符串中的 html 完整,并且只剪切标签中的实际文本。

我为它找到了一些自定义函数,但我还没有找到一个允许您设置起点的函数,例如。你想从哪里开始切割字符串。

这是我发现的一个:使用 PHP substr() 和 strip_tags() 同时保留格式且不破坏 HTML

所以,我基本上需要一个substr()与原始功能完全相同的功能,除了保持格式。

有什么建议么?

要修改的示例内容:

<p>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.</p>

从一开始就切断5后:

<p>ary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.</p>

开头和结尾有 5 个:

<p>ary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.1</p>

是啊,你明白我的意思吗?

如果要在一个单词的中间停止切割,我宁愿它切断整个单词,但这并不是非常重要。

** 编辑:** 固定引号。

4

3 回答 3

2

您所要求的内容涉及很多复杂性(本质上,在给定字符串偏移量的情况下生成一个有效的 html 子集),如果您以文本字符数表示的方式重新表述您的问题,那会更好您想要保留而不是剪切其中包含 html 的任意字符串。如果你这样做了,这个问题就变得容易多了,因为你可以使用真正的 HTML 解析器。您无需担心:

  • 不小心将元素切成两半。
  • 不小心将实体切成两半
  • 不计算元素内的文本。
  • 确保字符实体算作单个字符。
  • 确保所有元素都正确关闭。
  • 确保您不会破坏字符串,因为您使用substr()的是 utf-8 字符串。

可以使用正则表达式(使用u标志)和mb_substr()标签堆栈(我以前做过)来完成此操作,但是有很多边缘情况,您通常会遇到困难。

但是,DOM 解决方案相当简单:遍历所有文本节点,计算字符串长度,并根据需要删除或子串其文本内容。下面的代码执行此操作:

$html = <<<'EOT'
<p>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.</p>
EOT;
function substr_html($html, $start, $length=null, $removeemptyelements=true) {
    if (is_int($length)) {
        if ($length===0) return '';
        $end = $start + $length;
    } else {
        $end = null;
    }
    $d = new DOMDocument();
    $d->loadHTML('<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><title></title></head><body>'.$html.'</body>');
    $body = $d->getElementsByTagName('body')->item(0);
    $dxp = new DOMXPath($d);
    $t_start = 0; // text node's start pos relative to all text
    $t_end   = null; // text node's end pos relative to all text

    // copy because we may modify result of $textnodes
    $textnodes = iterator_to_array($dxp->query('/descendant::*/text()', $body));

// PHP 5.2 doesn't seem to implement Traversable on DOMNodeList,
// so `iterator_to_array()` won't work. Use this instead:
// $textnodelist = $dxp->query('/descendant::*/text()', $body);
// $textnodes = array();
// for ($i = 0; $i < $textnodelist->length; $i++) {
//  $textnodes[] = $textnodelist->item($i);
//}
//unset($textnodelist);

    foreach($textnodes as $text) {
        $t_end = $t_start + $text->length;
        $parent = $text->parentNode;
        if ($start >= $t_end || ($end!==null && $end < $t_start)) {
            $parent->removeChild($text);
        } else {
            $n_offset = max($start - $t_start, 0);
            $n_length = ($end===null) ? $text->length : $end - $t_start;
            if (!($n_offset===0 && $n_length >= $text->length)) {
                $substr = $text->substringData($n_offset, $n_length);
                if (strlen($substr)) {
                    $text->deleteData(0, $text->length);
                    $text->appendData($substr);
                } else {
                    $parent->removeChild($text);
                }
            }
        }

        // if removing this text emptied the parent of nodes, remove the node!
        if ($removeemptyelements && !$parent->hasChildNodes()) {
            $parent->parentNode->removeChild($parent);
        }

        $t_start = $t_end;
    }
    unset($textnodes);
    $newstr = $d->saveHTML($body);

    // mb_substr() is to remove <body></body> tags
    return mb_substr($newstr, 6, -7, 'utf-8');
}


echo substr_html($html, 480, 30);

这将输出:

<p> of "de Finibus</p> <p>Bonorum et Mal</p>

p请注意,您的“子字符串”跨越多个元素这一事实并不令人困惑。

于 2013-01-04T00:31:52.137 回答
1

这是一个开始,利用DOMDocument(一个 xml/html 解析器),RecursiveIteratorIterator(用于轻松遍历递归结构)和自定义DOMNodeList迭代器实现,与RecursiveIteratorIterator.

它仍然很草率(不返回副本,但作用于DOMNode/的引用DOMDocument),并且它没有常规的花哨功能substr(),例如$start和/或的负值$length,但它似乎做工作,到目前为止。我敢肯定有错误。但它应该让您了解如何使用DOMDocument.

自定义迭代器:

class DOMNodeListIterator
    implements Iterator
{
    protected $domNodeList;

    protected $position;

    public function __construct( DOMNodeList $domNodeList )
    {
        $this->domNodeList = $domNodeList;
        $this->rewind();
    }

    public function valid()
    {
        return $this->position < $this->domNodeList->length;
    }

    public function next()
    {
        $this->position++;
    }

    public function key()
    {
        return $this->position;
    }

    public function rewind()
    {
        $this->position = 0;
    }

    public function current()
    {
        return $this->domNodeList->item( $this->position );
    }
}

class RecursiveDOMNodeListIterator
    extends DOMNodeListIterator
    implements RecursiveIterator
{
    public function hasChildren()
    {
        return $this->current()->hasChildNodes();
    }

    public function getChildren()
    {
        return new self( $this->current()->childNodes );
    }
}

实际功能:

function DOMSubstr( DOMNode $domNode, $start = 0, $length = null )
{
    if( $start == 0 && ( $length == null || $length >= strlen( $domNode->nodeValue ) ) )
    {
        return;
    }

    $nodesToRemove = array();
    $rii = new RecursiveIteratorIterator( new RecursiveDOMNodeListIterator( $domNode->childNodes ), RecursiveIteratorIterator::SELF_FIRST );
    foreach( $rii as $node )
    {
        if( $start <= 0 && $length !== null && $length <= 0 )
        {
            /* can't remove immediately
             * because this will mess with
             * iterating over RecursiveIteratorIterator
             * so remember for removal, later on
             */
            $nodesToRemove[] = $node;
            continue;
        }

        if( $node->nodeType == XML_TEXT_NODE )
        {
            if( $start > 0 )
            {
                $count = min( $node->length, $start );
                $node->deleteData( 0, $count );
                $start -= $count;
            }

            if( $start <= 0 )
            {
                if( $length == null )
                {
                    break;
                }
                else if( $length <= 0 )
                {
                    continue;
                }
                else if( $length >= $node->length )
                {
                    $length -= $node->length;
                    continue;
                }
                else
                {
                    $node->deleteData( $length, $node->length - $length );
                    $length = 0;
                }
            }
        }
    }

    foreach( $nodesToRemove as $node )
    {
        $node->parentNode->removeChild( $node );
    }
}

用法:

$html = <<<HTML
<p>Just a short text sample with <a href="#">a link</a> and some trailing elements such as <strong>strong text<strong>, <em>emphasized text</em>, <del>deleted text</del> and <ins>inserted text</ins></p>
HTML;

$dom = new DomDocument();
$dom->loadHTML( $html );
/*
 * this is particularly sloppy:
 * I pass $dom->firstChild->nextSibling->firstChild (i.e. <body>)
 * because the function uses strlen( $domNode->nodeValue )
 * which will be 0 for DOMDocument itself
 * and I didn't want to utilize DOMXPath in the function
 * but perhaps I should have
 */
DOMSubstr( $dom->firstChild->nextSibling->firstChild, 8, 25 );

/*
 * passing a specific node to DOMDocument::saveHTML()
 * only works with PHP >= 5.3.6
 */
echo $dom->saveHTML( $dom->firstChild->nextSibling->firstChild->firstChild );
于 2013-01-03T16:59:08.780 回答
0

如果它不是更长的文本(因为运行时),你可以试试这个。

但在这种情况下,我需要从一开始就剪掉大约 120 个字符。

正是这样做的。输入您的文本或从某处抓取它,然后输入从头开始删除的字符数。

请不要强调这一点:它是短字符串的解决方案,它不是最好的方法,但它是一个完整的工作代码示例!

<?php
$text = "<a href='blablabla'>m</a>ylinks...<b>not this code is working</b>......";
$newtext = "";
$delete = 13;
$tagopen = false;

while ($text != ""){
    $checktag=$text[0];
    $text=substr( $text, 1 );
    if ($checktag =="<" || $tagopen == TRUE){
        $newtext .= $checktag;
        if ($checktag == ">"){
        $tagopen = FALSE;
        }
        else{
        $tagopen = TRUE;
        }
    }
    elseif ($delete > 0){   
        $delete = $delete -1 ;
        }
    else
    {
    $newtext .= $checktag;

    }
}
echo $newtext;



?>

它返回:

<a href='blablabla'></a><b> this code is working</b>......
于 2013-01-03T14:53:14.673 回答