php - 检测字符串中的非英语字符

Question

为了打击一些垃圾邮件，我正在寻找一种方法来找出字符串是否包含任何中文/西里尔字符。

我在http://en.wikipedia.org/wiki/UTF-8检查了 UTF-8 中的字符范围，但我无法弄清楚如何使用 PHP 中的字符范围。

我真正想做的是计算西里尔字母范围或中文范围内的字符数。这可以用一些正则表达式来完成吗？

score 3 · Accepted Answer

您可以检查每个字符的字节值是否包含在特定的 Unicode 范围内。以下是 Unicode 范围列表：http: //jrgraphix.net/research/unicode_blocks.php

score 3 · Accepted Answer

在这里找到了一个不错的解决方案：https ://devdojo.com/blog/tutorials/php-detect-if-non-english

使用此代码：

function is_english($str)
{
    if (strlen($str) != strlen(utf8_decode($str))) {
        return false;
    } else {
        return true;
    }
}

它之所以有效，是因为 utf8_decode 用单个字节替换多字节字符，这会导致不同的字符串长度。

score 1 · Accepted Answer

在 PHP 中，preg_match_all返回完整模式匹配的数量。

尝试

$n = preg_match_all('/\p{Cyrillic}/u', $text);

或者

$n = preg_match_all('/[\p{InCyrillic}\p{InCyrillic_Supplementary}]/u', $text);

有关在正则表达式中使用 unicode 的更多信息，请阅读这篇文章。

score 0 · Accepted Answer

您可以使用以下命令轻松检查字符串是否为纯 UTF-8：

mb_check_encoding($inputString, "UTF-8");

请注意，从 5.2.0 到 5.2.6 似乎有错误

您也可以在文档页面上找到您想要的内容mb_check_encoding，特别是在评论中。在 gmail dot com 对您的案例的回答中调整 javalc6：

function check_utf8($str) {
    $count = 0; // Amount of characters that are not UTF-8
    $len = strlen($str); 
    for($i = 0; $i < $len; $i++){ 
        $c = ord($str[$i]); 
        if ($c > 128) { 
            $bytes = 0;
            if ($c > 247) {
                ++$count;
                continue;
            } else if ($c > 239)
                $bytes = 4; 
            else if ($c > 223)
                $bytes = 3; 
            else if ($c > 191)
                $bytes = 2; 
            else {
                ++$count;
                continue;
            }
            if (($i + $bytes) > $len) {
                ++$count;
                continue;
            }
            while ($bytes > 1) { 
                $i++; 
                $b = ord($str[$i]); 
                if ($b < 128 || $b > 191)
                    ++$count;
                $bytes--; 
            }
        }
    }
    return count;
}

虽然老实说我没有检查它。

php - 检测字符串中的非英语字符

4 回答 4

Related

Reference