c# - 解析来自服务器的 UTF8 JSON 响应

Question

我在解析来自服务器的 JSON 响应时遇到了一个奇怪的问题。以这种方式获取响应（使用 Content-Type：text/html）时，过去几个月一直运行良好：

string response = "";
using (var client = new System.Net.Http.HttpClient())
{
    var postData = new System.Net.Http.FormUrlEncodedContent(data);
    var clientResult = await client.PostAsync(url, postData);
    if(clientResult.IsSuccessStatusCode)
    {
        response = await clientResult.Content.ReadAsStringAsync();
    }
}
//Parse the response to a JObject...

但是当接收到Content-Type: text/html;的响应时 charset=utf8它会抛出Content-Type is invalid的异常。

Exception message: The character set provided in ContentType is invalid. Cannot read content as string using an invalid character set.

所以我改变了这个：

response = await clientResult.Content.ReadAsStringAsync();

对此：

var raw_response = await clientResult.Content.ReadAsByteArrayAsync();
response = Encoding.UTF8.GetString(raw_response, 0, raw_response.Length);

现在我可以毫无例外地得到响应，但是在解析它时，它会抛出一个解析异常。在调试时我得到了这个：（我将响应更改为更短的响应以进行测试）

var r1 = await clientResult.Content.ReadAsStringAsync();
var r2 = Encoding.UTF8.GetString(await clientResult.Content.ReadAsByteArrayAsync(), 0, raw_response.Length);
System.Diagnostics.Debug.WriteLine("Length: {0} - {1}", r1.Length, r1);
System.Diagnostics.Debug.WriteLine("Length: {0} - {1}", r2.Length, r2);

//Output
Length: 38 - {"version":1,"specialword":"C\u00e3o"}
Length: 39 - {"version":1,"specialword":"C\u00e3o"}

JSON 响应格式在这两种情况下似乎都是正确的，但长度不同，我不知道为什么。将其复制到记事本++以发现隐藏字符时？不知从何而来。

Length: 38 - {"version":1,"specialword":"C\u00e3o"}
Length: 39 - ?{"version":1,"specialword":"C\u00e3o"}

这？显然是抛出解析异常，但我不知道为什么Encoding.UTF8.GetString会导致这种情况。

在过去的几个小时里，我一直在与这个作斗争，我真的需要一些帮助。

score 8 · Accepted Answer

Well, I'm surprised that you're getting that behavior, I would have expected Encoding.UTF8.GetString to have handled that for you.

What you're seeing, the character value 0xFEFF, is a byte order mark ("BOM"). A BOM is unnecessary in UTF-8 because the byte order is not variable, but it is allowed, as a marker that the following text is encoded UTF-8. (The actual byte sequence is EF BB BF, but then when that's decoded in UTF-8, it becomes code point FEFF.)

~~If you create your own UTF8Encoding instance, you can tell it whether to include or exclude the BOM.~~ (I think I'm mistaken about that, it may only control whether it includes one when encoding .)

Alternately, you could explicitly test for that and remove the BOM if present, e.g.:

var r2 = Encoding.UTF8.GetString(await clientResult.Content.ReadAsByteArrayAsync(), 0, raw_response.Length);
if (r2[0] == '\uFEFF') {
    r2 = r2.Substring(1);
}

c# - 解析来自服务器的 UTF8 JSON 响应

1 回答 1

Related

Reference