php - UTF8工作流程PHP、MySQL总结

Question

我正在为具有所有不同字母表的国际客户工作，因此我试图最终了解 PHP 和 MySQL 之间的完整工作流程，以确保正确插入所有字符编码。我已经阅读了很多关于这方面的教程，但仍然有问题（有很多东西要学），我想我可以把它们放在一起问。

PHP

header('Content-Type:text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');

HTML

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<form accept-charset="UTF-8"> .. </form>

（虽然后者是可选的，而是一个建议，但我相信我宁愿建议什么都不做）

MySQL

CREATE database_name DEFAULT CHARACTER SET utf8;或ALTER database_name DEFAULT CHARACTER SET utf8;和/或utf8_general_ci用作 MySQL 连接排序规则。

（这里需要注意的是，如果使用 varchar 会增加数据库大小）

联系

mysql_query("SET NAMES 'utf8'");
mysql_query("SET CHARACTER_SET utf8");

业务逻辑

检测是否不是 UTF8mb_detect_encoding()并转换为ivon().
验证过长的 UTF8 和 UTF16 序列

$body=preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]|(?<=^|[\x00-\x7F])[\x80-\xBF]+|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/','�',$body);
$body=preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $body);

问题

在 PHP 5.3 及更高版本中是mb_internal_encoding('UTF-8')必需的，如果是这样，这是否意味着我必须使用所有多字节函数而不是其核心函数，例如mb_substr()而不是substr()？
是否仍然需要检查格式错误的输入字符串，如果是这样，那么可靠的函数/类是什么？我可能不想删除不良数据并且对音译知之甚少。
真的应该是utf8_general_ci还是应该utf8_bin？
上述工作流程中是否缺少某些内容？

来源：

http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/  
http://webcollab.sourceforge.net/unicode.html  
http://stackoverflow.com/a/3742879/1043231  
http://www.adayinthelifeof.nl/2010/12/04/about-using-utf-8-fields-in-mysql/  
http://akrabat.com/php/utf8-php-and-mysql/

score 6 · Accepted Answer

mb_internal_encoding('UTF-8') doesn't do anything by itself, it only sets the default encoding parameter for each mb_ function. If you're not using any mb_ function, it doesn't make any difference. If you are, it makes sense to set it so you don't have to pass the $encoding parameter each time individually.
IMO mb_detect_encoding is mostly useless since it's fundamentally impossible to accurately detect the encoding of unknown text. You should either know what encoding a blob of text is in because you have a specification about it, or you need to parse appropriate meta data like headers or meta tags where the encoding is specified.
Using mb_check_encoding to check if a blob of text is valid in the encoding you expect it to be in is typically sufficient. If it's not, discard it and throw an appropriate error.
Regarding:

does this mean I have to use all multi byte functions instead of its core functions

If you are manipulating strings that contain multibyte characters, then yes, you need to use the mb_ functions to avoid getting wrong results. The core string functions only work on a byte level, not a character level, which is what you typically want when working with strings.
utf8_general_ci vs. utf8_bin only makes a difference when collating, i.e. sorting and comparing strings. With utf8_bin data is treated in binary form, i.e. only identical data is identical. With utf8_general_ci some logic is applied, e.g. "é" sorts together with "e" and upper case is considered equal to lower case.

score 1 · Accepted Answer

它真的应该是 utf8_general_ci 还是 utf8_bin？

您必须使用 utf8_bin 进行区分大小写的搜索，否则 utf8_general_ci

is mb_internal_encoding('UTF-8') necessary in PHP 5.3 and higher and if so does this mean I have to use all multi byte functions instead of its core functions like mb_substr() instead of substr()?

Yes of course, If you have a multibyte string you need mb_* family function to work with, except for binary safe php standard function like str_replace(); (and few others)

is it still necessary to check for malformed input stings and if so what is a reliable function/class to do so? I possibly do not want to strip bad data and don't know enough about transliteration.

Hmm, no you can't check it.

php - UTF8工作流程PHP、MySQL总结

2 回答 2

Related

Reference