php - PHP - 带有特殊字符的 X 字符后的子字符串

Question

对不起标题，我真的不知道该怎么说......

我经常有一个字符串需要在 X 字符之后剪切，我的问题是这个字符串通常包含特殊字符，例如：& egrave ;

所以，我想知道，他们是在 php 中了解的一种方式，而无需转换我的字符串，如果当我切割我的字符串时，我处于一个特殊字符的中间。

例子

This is my string with a special char : &egrave; - and I want it to cut in the middle of the "&egrave;" but still keeping the string intact

所以现在我的子字符串结果是：

This is my string with a special char : &egra

但我想要这样的东西：

This is my string with a special char : &egrave;

score 7 · Accepted Answer

最好的办法是将您的字符串存储为 UTF-8 而不使用任何 html 实体，并使用mb_*函数族utf8作为编码。

但是，如果您的字符串是 ASCII 或 iso-8859-1/win1252，您可以使用HTML-ENTITIESmb_string 库的特殊编码：

$s = 'This is my string with a special char : &egrave; - and I want it to cut in the middle of the "&egrave;" but still keeping the string intact';
echo mb_substr($s, 0, 40, 'HTML-ENTITIES');
echo mb_substr($s, 0, 41, 'HTML-ENTITIES');

但是，如果您的底层字符串是 UTF-8 或其他一些多字节编码，那么使用HTML-ENTITIES是不安全的！这是因为HTML-ENTITIES真正的意思是“win1252 将高位字符作为 html 实体”。这是一个可能出错的示例：

// Assuming that é is in utf8:
mb_substr('é ', 0, 2, 'HTML-ENTITIES') === '&Atilde;&copy;'
// should be '&eacute; '

当您的字符串采用多字节编码时，您必须在拆分之前将所有 html 实体转换为通用编码。例如：

$strings_actual_encoding = 'utf8';
$s_noentities = html_entity_decode($s, ENT_QUOTES, $strings_actual_encoding); 
$s_trunc_noentities =  mb_substr($s_noentities, 0, 41, $strings_actual_encoding);

score 3 · Accepted Answer

您可以先使用 html_entity_decode() 来解码所有 HTML 实体。然后拆分你的字符串。然后 htmlentities() 重新编码实体。

$decoded_string = html_entity_decode($original_string);
// implement logic to split string here

// then for each string part do the following:
$encoded_string_part = htmlentities($split_string_part);

score 3 · Accepted Answer

最长的 HTML 实体为 10 个字符，包括 & 和分号。如果您打算在X字节处剪切字符串，请检查字节X-9中X-1的与号。如果相应的分号出现在 byteX或之后，则在分号之后而不是 byte 之后剪切字符串X。

但是，如果您愿意对字符串进行预处理，Mike 的解决方案会更准确，因为他在 characters 处切割字符串X ，而不是字节。

score 3 · Accepted Answer

最好的解决方案是将文本存储为 UTF-8，而不是将它们存储为 HTML 实体。除此之外，如果您不介意计数关闭（&grave;等于一个字符，而不是 7），那么以下代码段应该可以工作：

<?php
$string = 'This is my string with a special char : &egrave; - and I want it to cut in the middle of the "&egrave;" but still keeping the string intact';
$cut_string = htmlentities(mb_substr(html_entity_decode($string, NULL, 'UTF-8'), 0, 45), NULL, 'UTF-8')."<br><br>";

注意：如果您使用不同的函数对文本进行编码（例如htmlspecialchars()），则使用该函数而不是htmlentities(). 如果您使用自定义函数，则使用与新自定义函数相反的另一个自定义函数，而不是html_entity_decode()（和自定义函数而不是htmlentities()）。

score 2 · Accepted Answer

一个小蛮力解决方案，我对PCRE表达式不太满意，假设您要传递 80 个字符，并且可能的最长 HTML 表达式是 7 个字符长：

$regex = '~^(.{73}([^&]{7}|.{0,7}$|[^&]{0,6}&[^;]+;))(.*)~mx'
// Note, this could return a bit of shorter text
return preg_replace( $regexp, '$1', $text);

只是让你知道：

.{73}- 73 个字符
[^&]{7}- 好吧，我们可以用任何不包含 & 的东西来填充它
.{0,7}$- 记住可能的结尾（这不应该是必要的，因为较短的文本根本不匹配）
[^&]{0,6}&[^;]+;- 最多 6 个字符（你会在第 79 位），然后&让它完成

看起来要好得多但需要玩一些数字的东西是：

// check whether $text is at least $N chars long :)
if( strlen( $text) < $N){
    return;
}

// Get last &
$pos = strrpos( $text, '&', $N);

// We're not young anymore, we have to check this too (not entries at all) :)
if( $pos === false){
    return substr( $text, 0, $N);
}

// Get Last
$end = strpos( $text, ';', $N);

// false wouldn't be smaller then 0 (entry open at the beginning
if( $end === false){
    $end = -1;
}

// Okay, entry closed (; is after &)(
if( $end > $pos){
   return substr($text, 0, $N);
}

// Now we need to find first ;
$end = strpos( $text, ';', $N)
if( $end === false){
    // Not valid HTML, not closed entry, do whatever you want
}

return substr($text, 0, $end);

_{检查数字，索引中的某处可能有 +/-1...}

score 0 · Accepted Answer

我认为您必须使用 strpos 和 strrpos 的组合来查找下一个和上一个空格，解析空格之间的文本，根据已知的特殊字符列表检查它，如果匹配，请将您的“剪切”扩展到下一个空格的位置。如果你有你现在拥有的代码示例，我们可以给你一个更好的答案。

php - PHP - 带有特殊字符的 X 字符后的子字符串

6 回答 6

Related

Reference