21

我有一个字符范围限制列表,我需要检查一个字符串,但是char.NET 中的类型是 UTF-16,因此一些字符变成了古怪的(代理)对。因此,在枚举chara 中的所有 '时string,我没有得到 32 位 Unicode 代码点,并且一些与高值的比较失败。

我对 Unicode 有足够的了解,可以在必要时自己解析字节,但我正在寻找 C#/.NET Framework BCL 解决方案。所以 ...

如何将 a 转换为 32 位 Unicode 代码点string的数组 ( )?int[]

4

5 回答 5

23
于 2015-01-26T17:12:01.667 回答
7

这个答案是不正确的。请参阅@Virtlink 的正确答案。

static int[] ExtractScalars(string s)
{
  if (!s.IsNormalized())
  {
    s = s.Normalize();
  }

  List<int> chars = new List<int>((s.Length * 3) / 2);

  var ee = StringInfo.GetTextElementEnumerator(s);

  while (ee.MoveNext())
  {
    string e = ee.GetTextElement();
    chars.Add(char.ConvertToUtf32(e, 0));
  }

  return chars.ToArray();
}

注意:处理复合字符需要规范化。

于 2009-03-26T20:28:10.787 回答
4

Doesn't seem like it should be much more complicated than this:

public static IEnumerable<int> Utf32CodePoints( this IEnumerable<char> s )
{
  bool      useBigEndian = !BitConverter.IsLittleEndian;
  Encoding  utf32        = new UTF32Encoding( useBigEndian , false , true ) ;
  byte[]    octets       = utf32.GetBytes( s ) ;

  for ( int i = 0 ; i < octets.Length ; i+=4 )
  {
    int codePoint = BitConverter.ToInt32(octets,i);
    yield return codePoint;
  }

}
于 2015-01-26T18:11:49.250 回答
0

I came up with the same approach suggested by Nicholas (and Jeppe), just shorter:

    public static IEnumerable<int> GetCodePoints(this string s) {
        var utf32 = new UTF32Encoding(!BitConverter.IsLittleEndian, false, true);
        var bytes = utf32.GetBytes(s);
        return Enumerable.Range(0, bytes.Length / 4).Select(i => BitConverter.ToInt32(bytes, i * 4));
    }

The enumeration was all I needed, but getting an array is trivial:

int[] codePoints = myString.GetCodePoints().ToArray();
于 2016-07-19T14:10:27.377 回答
0

This solution produces the same results as the solution by Daniel A.A. Pelsmaeker but is a little bit shorter:

public static int[] ToCodePoints(string s)
{
    byte[] utf32bytes = Encoding.UTF32.GetBytes(s);
    int[] codepoints = new int[utf32bytes.Length / 4];
    Buffer.BlockCopy(utf32bytes, 0, codepoints, 0, utf32bytes.Length);
    return codepoints;
}
于 2020-06-12T06:44:08.073 回答