tldr;
- 足够短的字符串不应附加省略号。
- 换行符也应该是限定断点。
- 正则表达式,一旦被分解和解释,并不太可怕。
我认为关于这个问题和当前的一系列答案,有一些重要的事情需要指出。我将根据 Gordon 的示例数据和一些其他案例来演示答案的比较以及我的正则表达式答案,以揭示一些不同的结果。
首先,要明确输入值的质量。Gordon 说该函数需要是多字节安全的并尊重字边界。在确定截断位置时,示例数据没有公开对非空格、非单词字符(例如标点符号)的期望处理,因此我们必须假设定位空白字符就足够了——而且明智地如此,因为大多数“阅读更多”字符串在截断时不会担心尊重标点符号。
其次,在相当常见的情况下,需要对包含换行符的大量文本应用省略号。
第三,让我们随意同意一些基本的数据标准化,例如:
- 字符串已被修剪掉所有前导/尾随空白字符
- 的值
$chars
总是大于mb_strlen()
的$terminator
(演示)
功能:
function truncateGumbo($string, $chars = 50, $terminator = ' …') {
$cutPos = $chars - mb_strlen($terminator);
$boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}
function truncateGordon($string, $chars = 50, $terminator = ' …') {
return mb_strimwidth($string, 0, $chars, $terminator);
}
function truncateSoapBox($string, $chars = 50, $terminate = ' …')
{
$chars -= mb_strlen($terminate);
if ( $chars <= 0 )
return $terminate;
$string = mb_substr($string, 0, $chars);
$space = mb_strrpos($string, ' ');
if ($space < mb_strlen($string) / 2)
return $string . $terminate;
else
return mb_substr($string, 0, $space) . $terminate;
}
function truncateMickmackusa($string, $max = 50, $terminator = ' …') {
$trunc = $max - mb_strlen($terminator, 'UTF-8');
return preg_replace("~(?=.{{$max}})(?:\S{{$trunc}}|.{0,$trunc}(?=\s))\K.+~us", $terminator, $string);
}
测试用例:
$tests = [
[
'testCase' => "Answer to the Ultimate Question of Life, the Universe, and Everything.",
// 50th char ---------------------------------------------------^
'expected' => "Answer to the Ultimate Question of Life, the …",
],
[
'testCase' => "A single line of text to be followed by another\nline of text",
// 50th char ----------------------------------------------------^
'expected' => "A single line of text to be followed by another …",
],
[
'testCase' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ",
// 50th char ---------------------------------------------------^
'expected' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …",
],
[
'testCase' => "123456789 123456789 123456789 123456789 123456789",
// 50th char doesn't exist -------------------------------------^
'expected' => "1234567890123456789012345678901234567890123456789",
],
[
'testCase' => "Hello worldly world",
// 50th char doesn't exist -------------------------------------^
'expected' => "Hello worldly world",
],
[
'testCase' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890",
// 50th char ---------------------------------------------------^
'expected' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …",
],
];
执行:
foreach ($tests as ['testCase' => $testCase, 'expected' => $expected]) {
echo "\tSample Input:\t\t$testCase\n";
echo "\n\ttruncateGumbo:\t\t" , truncateGumbo($testCase);
echo "\n\ttruncateGordon:\t\t" , truncateGordon($testCase);
echo "\n\ttruncateSoapBox:\t" , truncateSoapBox($testCase);
echo "\n\ttruncateMickmackusa:\t" , truncateMickmackusa($testCase);
echo "\n\tExpected Result:\t{$expected}";
echo "\n-----------------------------------------------------\n";
}
输出:
Sample Input: Answer to the Ultimate Question of Life, the Universe, and Everything.
truncateGumbo: Answer to the Ultimate Question of Life, the …
truncateGordon: Answer to the Ultimate Question of Life, the Uni …
truncateSoapBox: Answer to the Ultimate Question of Life, the …
truncateMickmackusa: Answer to the Ultimate Question of Life, the …
Expected Result: Answer to the Ultimate Question of Life, the …
-----------------------------------------------------
Sample Input: A single line of text to be followed by another
line of text
truncateGumbo: A single line of text to be followed by …
truncateGordon: A single line of text to be followed by another
…
truncateSoapBox: A single line of text to be followed by …
truncateMickmackusa: A single line of text to be followed by another …
Expected Result: A single line of text to be followed by another …
-----------------------------------------------------
Sample Input: âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ
truncateGumbo: âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
truncateGordon: âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
truncateSoapBox: âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
truncateMickmackusa: âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
Expected Result: âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
-----------------------------------------------------
Sample Input: 123456789 123456789 123456789 123456789 123456789
truncateGumbo: 123456789 123456789 123456789 123456789 12345678 …
truncateGordon: 123456789 123456789 123456789 123456789 123456789
truncateSoapBox: 123456789 123456789 123456789 123456789 …
truncateMickmackusa: 123456789 123456789 123456789 123456789 123456789
Expected Result: 123456789 123456789 123456789 123456789 123456789
-----------------------------------------------------
Sample Input: Hello worldly world
truncateGumbo:
Warning: mb_strpos(): Offset not contained in string in /in/ibFH5 on line 4
Hello worldly world …
truncateGordon: Hello worldly world
truncateSoapBox: Hello worldly …
truncateMickmackusa: Hello worldly world
Expected Result: Hello worldly world
-----------------------------------------------------
Sample Input: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890
truncateGumbo: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
truncateGordon: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
truncateSoapBox: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
truncateMickmackusa: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
Expected Result: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
-----------------------------------------------------
我的模式解释:
尽管看起来确实很不雅观,但大多数乱码模式语法都是将数值作为动态量词插入的问题。
我也可以写成:
'~(?:\S{' . $trunc . '}|(?=.{' . $max . '}).{0,' . $trunc . '}(?=\s))\K.+~us'
为简单起见,我将替换$trunc
为48
和。$max
50
~ #opening pattern delimiter
(?=.{50}) #lookahead to ensure that the string has a minimum of 50 characters
(?: #start of non-capturing group -- to maintain pattern logic only
\S{48} #the string starts with at least 48 non-white-space characters
| #or
.{0,48}(?=\s) #the string starts with upto 48 characters followed by a whitespace
) #end of non-capturing group
\K #restart the fullstring match (aka "forget" the previously matched characters)
.+ #match the remaining characters (these characters will be replaced)
~ #closing pattern delimiter
us #pattern modifiers: unicode/multibyte flag & dot matches newlines flag