php - 将多字节字符串截断为 n 个字符

Question

我正在尝试在字符串过滤器中使用此方法：

public function truncate($string, $chars = 50, $terminator = ' …');

我期待这个

$in  = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890";
$out = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …";

还有这个

$in  = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ";
$out = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …";

那是$chars减去字符串的字符$terminator。

此外，过滤器应该在低于$chars限制的第一个单词边界处切割，例如

$in  = "Answer to the Ultimate Question of Life, the Universe, and Everything.";
$out = "Answer to the Ultimate Question of Life, the …";

我很确定这应该适用于这些步骤

从最大字符中减去终止符中的字符数
验证字符串是否比计算的限制长或原样返回
找到低于计算限制的字符串中的最后一个空格字符以获取单词边界
如果没有找到最后一个空格，则在最后一个空格处剪切字符串或计算限制
将终止符附加到字符串
返回字符串

但是，我现在尝试了各种str*和mb_*功能的组合，但都产生了错误的结果。这不可能那么难，所以我显然错过了一些东西。有人会为此分享一个有效的实现，或者将我指向一个我最终可以理解如何做到这一点的资源。

谢谢

PS 是的，我之前检查过https://stackoverflow.com/search?q=truncate+string+php :)

score 5 · Accepted Answer

刚刚发现 PHP 已经有一个多字节截断

mb_strimwidth— 获取指定宽度的截断字符串

但它不遵守单词边界。不过还是很方便的！

score 3 · Accepted Answer

尝试这个：

function truncate($string, $chars = 50, $terminator = ' …') {
    $cutPos = $chars - mb_strlen($terminator);
    $boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
    return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}

但是您需要确保正确设置了内部编码。

score 0 · Accepted Answer

我通常不喜欢为这样的问题编写完整的答案。但我也刚醒来，我想也许你的问题会让我心情愉快地在一天的剩余时间里去编程。

我没有尝试运行它，但它应该可以工作，或者至少可以让你完成 90% 的工作。

function truncate( $string, $chars = 50, $terminate = ' ...' )
{
    $chars -= mb_strlen($terminate);
    if ( $chars <= 0 )
        return $terminate;

    $string = mb_substr($string, 0, $chars);
    $space = mb_strrpos($string, ' ');

    if ($space < mb_strlen($string) / 2)
        return $string . $terminate;
    else
        return mb_substr($string, 0, $space) . $terminate;
}

score 0 · Accepted Answer

tldr;

足够短的字符串不应附加省略号。
换行符也应该是限定断点。
正则表达式，一旦被分解和解释，并不太可怕。

我认为关于这个问题和当前的一系列答案，有一些重要的事情需要指出。我将根据 Gordon 的示例数据和一些其他案例来演示答案的比较以及我的正则表达式答案，以揭示一些不同的结果。

首先，要明确输入值的质量。Gordon 说该函数需要是多字节安全的并尊重字边界。在确定截断位置时，示例数据没有公开对非空格、非单词字符（例如标点符号）的期望处理，因此我们必须假设定位空白字符就足够了——而且明智地如此，因为大多数“阅读更多”字符串在截断时不会担心尊重标点符号。

其次，在相当常见的情况下，需要对包含换行符的大量文本应用省略号。

第三，让我们随意同意一些基本的数据标准化，例如：

字符串已被修剪掉所有前导/尾随空白字符
的值$chars总是大于mb_strlen()的$terminator

（演示）

功能：

function truncateGumbo($string, $chars = 50, $terminator = ' …') {
    $cutPos = $chars - mb_strlen($terminator);
    $boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
    return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}

function truncateGordon($string, $chars = 50, $terminator = ' …') {
    return mb_strimwidth($string, 0, $chars, $terminator);
}

function truncateSoapBox($string, $chars = 50, $terminate = ' …')
{
    $chars -= mb_strlen($terminate);
    if ( $chars <= 0 )
        return $terminate;

    $string = mb_substr($string, 0, $chars);
    $space = mb_strrpos($string, ' ');

    if ($space < mb_strlen($string) / 2)
        return $string . $terminate;
    else
        return mb_substr($string, 0, $space) . $terminate;
}

function truncateMickmackusa($string, $max = 50, $terminator = ' …') {
    $trunc = $max - mb_strlen($terminator, 'UTF-8');
    return preg_replace("~(?=.{{$max}})(?:\S{{$trunc}}|.{0,$trunc}(?=\s))\K.+~us", $terminator, $string);
}

测试用例：

$tests = [
    [
        'testCase' => "Answer to the Ultimate Question of Life, the Universe, and Everything.",
        // 50th char ---------------------------------------------------^
        'expected' => "Answer to the Ultimate Question of Life, the …&quot;,
    ],
    [
        'testCase' => "A single line of text to be followed by another\nline of text",
        // 50th char ----------------------------------------------------^
        'expected' => "A single line of text to be followed by another …&quot;,
    ],
    [
        'testCase' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ",
        // 50th char ---------------------------------------------------^
        'expected' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …&quot;,
    ],
    [
        'testCase' => "123456789 123456789 123456789 123456789 123456789",
        // 50th char doesn't exist -------------------------------------^
        'expected' => "1234567890123456789012345678901234567890123456789",
    ],
    [
        'testCase' => "Hello worldly world",
        // 50th char doesn't exist -------------------------------------^
        'expected' => "Hello worldly world",
    ],
    [
        'testCase' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890",
        // 50th char ---------------------------------------------------^
        'expected' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …&quot;,
    ],
];

执行：

foreach ($tests as ['testCase' => $testCase, 'expected' => $expected]) {
    echo "\tSample Input:\t\t$testCase\n";
    echo "\n\ttruncateGumbo:\t\t" , truncateGumbo($testCase);
    echo "\n\ttruncateGordon:\t\t" , truncateGordon($testCase);
    echo "\n\ttruncateSoapBox:\t" , truncateSoapBox($testCase);
    echo "\n\ttruncateMickmackusa:\t" , truncateMickmackusa($testCase);
    echo "\n\tExpected Result:\t{$expected}";
    echo "\n-----------------------------------------------------\n";
}

输出：

    Sample Input:           Answer to the Ultimate Question of Life, the Universe, and Everything.

    truncateGumbo:          Answer to the Ultimate Question of Life, the …
    truncateGordon:         Answer to the Ultimate Question of Life, the Uni …
    truncateSoapBox:        Answer to the Ultimate Question of Life, the …
    truncateMickmackusa:    Answer to the Ultimate Question of Life, the …
    Expected Result:        Answer to the Ultimate Question of Life, the …
-----------------------------------------------------
    Sample Input:           A single line of text to be followed by another
line of text

    truncateGumbo:          A single line of text to be followed by …
    truncateGordon:         A single line of text to be followed by another
 …
    truncateSoapBox:        A single line of text to be followed by …
    truncateMickmackusa:    A single line of text to be followed by another …
    Expected Result:        A single line of text to be followed by another …
-----------------------------------------------------
    Sample Input:           âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ

    truncateGumbo:          âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
    truncateGordon:         âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
    truncateSoapBox:        âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
    truncateMickmackusa:    âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
    Expected Result:        âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
-----------------------------------------------------
    Sample Input:           123456789 123456789 123456789 123456789 123456789

    truncateGumbo:          123456789 123456789 123456789 123456789 12345678 …
    truncateGordon:         123456789 123456789 123456789 123456789 123456789
    truncateSoapBox:        123456789 123456789 123456789 123456789 …
    truncateMickmackusa:    123456789 123456789 123456789 123456789 123456789
    Expected Result:        123456789 123456789 123456789 123456789 123456789
-----------------------------------------------------
    Sample Input:           Hello worldly world

    truncateGumbo:          
Warning: mb_strpos(): Offset not contained in string in /in/ibFH5 on line 4
Hello worldly world …
    truncateGordon:         Hello worldly world
    truncateSoapBox:        Hello worldly …
    truncateMickmackusa:    Hello worldly world
    Expected Result:        Hello worldly world
-----------------------------------------------------
    Sample Input:           abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890

    truncateGumbo:          abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateGordon:         abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateSoapBox:        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateMickmackusa:    abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    Expected Result:        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
-----------------------------------------------------

我的模式解释：

尽管看起来确实很不雅观，但大多数乱码模式语法都是将数值作为动态量词插入的问题。

我也可以写成：

'~(?:\S{' . $trunc . '}|(?=.{' . $max . '}).{0,' . $trunc . '}(?=\s))\K.+~us'

为简单起见，我将替换$trunc为48和。$max50

~                 #opening pattern delimiter
(?=.{50})         #lookahead to ensure that the string has a minimum of 50 characters
(?:               #start of non-capturing group -- to maintain pattern logic only
  \S{48}          #the string starts with at least 48 non-white-space characters
  |               #or
  .{0,48}(?=\s)   #the string starts with upto 48 characters followed by a whitespace
)                 #end of non-capturing group
\K                #restart the fullstring match (aka "forget" the previously matched characters)
.+                #match the remaining characters (these characters will be replaced)
~                 #closing pattern delimiter
us                #pattern modifiers: unicode/multibyte flag & dot matches newlines flag

php - 将多字节字符串截断为 n 个字符

4 回答 4

Related

Reference