c# - 从转义的 ASCII 序列中读取 UTF8/UNICODE 字符

Question

我在一个文件中有以下名称，我需要将字符串作为 UTF8 编码的字符串读取，所以从这里开始：

test_\303\246\303\270\303\245.txt

我需要获得以下内容：

test_æøå.txt

你知道如何使用 C# 来实现这一点吗？

score 4 · Accepted Answer

假设你有这个字符串：

string input = "test_\\303\\246\\303\\270\\303\\245.txt";

IE字面意思

test_\303\246\303\270\303\245.txt

你可以这样做：

string input = "test_\\303\\246\\303\\270\\303\\245.txt";
Encoding iso88591 = Encoding.GetEncoding(28591); //See note at the end of answer
Encoding utf8 = Encoding.UTF8;


//Turn the octal escape sequences into characters having codepoints 0-255
//this results in a "binary string"
string binaryString = Regex.Replace(input, @"\\(?<num>[0-7]{3})", delegate(Match m)
{
    String oct = m.Groups["num"].ToString();
    return Char.ConvertFromUtf32(Convert.ToInt32(oct, 8));

});

//Turn the "binary string" into bytes
byte[] raw = iso88591.GetBytes(binaryString);

//Read the bytes into C# string
string output = utf8.GetString(raw);
Console.WriteLine(output);
//test_æøå.txt

“二进制字符串”是指仅由代码点为 0-255 的字符组成的字符串。因此，它相当于一个穷人byte[]，您在 index 处检索字符的代码点i，而不是byte在 a byte[]at index中的值i（这是我们几年前在 javascript 中所做的）。因为 iso-8859-1 将前 256 个 unicode 代码点精确映射到一个字节，所以它非常适合将“二进制字符串”转换为byte[].

c# - 从转义的 ASCII 序列中读取 UTF8/UNICODE 字符

1 回答 1

Related

Reference