我玩得很开心。希望这可以帮助。
因为 Javascript 不允许对字符串进行直接字节访问,所以找到起始位置的唯一方法是前向扫描。
更新#3我不认为改变字符代码真的有效。当正确答案是三个时,我正在读取两个字节......不知何故我总是忘记这一点。UTF8 和 UTF16 的代码点相同,但编码占用的字节数取决于编码!!!所以这不是正确的方法。
这是不正确的 - 实际上 javascript 中没有 UTF-8 字符串。根据 ECMAScript 262 规范,所有字符串 - 无论输入编码如何 - 都必须在内部存储为 UTF-16(“[sequence of] 16-bit unsigned integers”)。
考虑到这一点,8 位移位是正确的(但没有必要)。
假设您的字符存储为 3 字节序列是错误的……
事实上,JS(ECMA-262)字符串中的所有字符都是 16 位(2 字节)长。
这可以通过手动将多字节字符转换为 utf-8 来解决,如下面的代码所示。
请参阅我的示例代码中解释的详细信息:
function encode_utf8( s )
{
return unescape( encodeURIComponent( s ) );
}
function substr_utf8_bytes(str, startInBytes, lengthInBytes) {
/* this function scans a multibyte string and returns a substring.
* arguments are start position and length, both defined in bytes.
*
* this is tricky, because javascript only allows character level
* and not byte level access on strings. Also, all strings are stored
* in utf-16 internally - so we need to convert characters to utf-8
* to detect their length in utf-8 encoding.
*
* the startInBytes and lengthInBytes parameters are based on byte
* positions in a utf-8 encoded string.
* in utf-8, for example:
* "a" is 1 byte,
"ü" is 2 byte,
and "你" is 3 byte.
*
* NOTE:
* according to ECMAScript 262 all strings are stored as a sequence
* of 16-bit characters. so we need a encode_utf8() function to safely
* detect the length our character would have in a utf8 representation.
*
* http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf
* see "4.3.16 String Value":
* > Although each value usually represents a single 16-bit unit of
* > UTF-16 text, the language does not place any restrictions or
* > requirements on the values except that they be 16-bit unsigned
* > integers.
*/
var resultStr = '';
var startInChars = 0;
// scan string forward to find index of first character
// (convert start position in byte to start position in characters)
for (bytePos = 0; bytePos < startInBytes; startInChars++) {
// get numeric code of character (is >128 for multibyte character)
// and increase "bytePos" for each byte of the character sequence
ch = str.charCodeAt(startInChars);
bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length;
}
// now that we have the position of the starting character,
// we can built the resulting substring
// as we don't know the end position in chars yet, we start with a mix of
// chars and bytes. we decrease "end" by the byte count of each selected
// character to end up in the right position
end = startInChars + lengthInBytes - 1;
for (n = startInChars; startInChars <= end; n++) {
// get numeric code of character (is >128 for multibyte character)
// and decrease "end" for each byte of the character sequence
ch = str.charCodeAt(n);
end -= (ch < 128) ? 1 : encode_utf8(str[n]).length;
resultStr += str[n];
}
return resultStr;
}
var orig = 'abc你好吗?';
alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab"
alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c"
alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你"
alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗"