javascript - Substr 基于字节而不是字符数

Question

我正在创建一个字段最大值只能是 200 字节的输入系统。我正在使用以下方法计算剩余的字节数（这种方法可能也有争议！）：

var totalBytes = 200;
var $newVal = $(this).val();
var m = encodeURIComponent($newVal).match(/%[89ABab]/g);
var bytesLeft = totalBytes - ($newVal.length + (m ? m.length : 0));

这似乎效果很好，但是如果有人要粘贴大量数据，我希望能够对输入进行切片并且只显示 200 个字节。我猜在伪代码中看起来像：

$newText = substrBytes($string, 0, 200);

任何帮助或指导将不胜感激。

编辑：这里发生的一切都是UTF-8 btw :)

编辑2：我知道我可以循环每个角色并进行评估，我想我希望可能有一些更优雅的东西来处理这个问题。

谢谢！

score 3 · Accepted Answer

A Google search yielded a blog article, complete with a try-it-yourself input box. I'm copying the code here because SO likes definitive answers rather than links, but credit goes to McDowell.

/**
 * codePoint - an integer containing a Unicode code point
 * return - the number of bytes required to store the code point in UTF-8
 */
function utf8Len(codePoint) {
  if(codePoint >= 0xD800 && codePoint <= 0xDFFF)
    throw new Error("Illegal argument: "+codePoint);
  if(codePoint < 0) throw new Error("Illegal argument: "+codePoint);
  if(codePoint <= 0x7F) return 1;
  if(codePoint <= 0x7FF) return 2;
  if(codePoint <= 0xFFFF) return 3;
  if(codePoint <= 0x1FFFFF) return 4;
  if(codePoint <= 0x3FFFFFF) return 5;
  if(codePoint <= 0x7FFFFFFF) return 6;
  throw new Error("Illegal argument: "+codePoint);
}

function isHighSurrogate(codeUnit) {
  return codeUnit >= 0xD800 && codeUnit <= 0xDBFF;
}

function isLowSurrogate(codeUnit) {
  return codeUnit >= 0xDC00 && codeUnit <= 0xDFFF;
}

/**
 * Transforms UTF-16 surrogate pairs to a code point.
 * See RFC2781
 */
function toCodepoint(highCodeUnit, lowCodeUnit) {
  if(!isHighSurrogate(highCodeUnit)) throw new Error("Illegal argument: "+highCodeUnit);
  if(!isLowSurrogate(lowCodeUnit)) throw new Error("Illegal argument: "+lowCodeUnit);
  highCodeUnit = (0x3FF & highCodeUnit) << 10;
  var u = highCodeUnit | (0x3FF & lowCodeUnit);
  return u + 0x10000;
}

/**
 * Counts the length in bytes of a string when encoded as UTF-8.
 * str - a string
 * return - the length as an integer
 */
function utf8ByteCount(str) {
  var count = 0;
  for(var i=0; i<str.length; i++) {
    var ch = str.charCodeAt(i);
    if(isHighSurrogate(ch)) {
      var high = ch;
      var low = str.charCodeAt(++i);
      count += utf8Len(toCodepoint(high, low));
    } else {
      count += utf8Len(ch);
    }
  }
  return count;
}

score 1 · Accepted Answer

Strings in JavaScript are represented in UTF-16 internally, so every character take actually two bytes. So your question is more like "Get bytes length of str in UTF-8".

Hardly you need half of a symbol, so it may cut 198 or 199 bytes.

Here're 2 different solutions:

// direct byte size counting
function cutInUTF8(str, n) {
    var len = Math.min(n, str.length);
    var i, cs, c = 0, bytes = 0;
    for (i = 0; i < len; i++) {
        c = str.charCodeAt(i);
        cs = 1;
        if (c >= 128) cs++;
        if (c >= 2048) cs++;
        if (c >= 0xD800 && c < 0xDC00) {
            c = str.charCodeAt(++i);
            if (c >= 0xDC00 && c < 0xE000) {
                cs++;
            } else {
                // you might actually want to throw an error
                i--;
            }
        }
        if (n < (bytes += cs)) break;
    }
    return str.substr(0, i);
}

// using internal functions, but is not very fast due to try/catch
function cutInUTF8(str, n) {
    var encoded = unescape(encodeURIComponent(str)).substr(0, n);
    while (true) {
        try {
            str = decodeURIComponent(escape(encoded));
            return str;
        } catch(e) {
            encoded = encoded.substr(0, encoded.length-1);
        }
    }
}

javascript - Substr 基于字节而不是字符数

2 回答 2

Related

Reference