c# - 从 ANSII 编码的文件中读取文本

Question

我使用 Q42.Winrt 库将 html 文件下载到缓存。但是当我使用 ReadTextAsync 我有例外：

目标多字节代码页中不存在 Unicode 字符的映射。（来自 HRESULT 的异常：0x80070459）

我的代码很简单

var parsedPage = await WebDataCache.GetAsync(new Uri(String.Format("http://someUrl.here")));
var parsedStream = await FileIO.ReadTextAsync(parsedPage);

我打开下载的文件并且有 ANSII 编码。我想我需要将其转换为 UTF-8，但我不知道如何。

score 6 · Accepted Answer

问题是原始页面的编码不是Unicode，是Windows-1251，ReadTextAsync函数只处理Unicode或UTF8。解决此问题的方法是将文件读取为二进制文件，然后使用Encoding.GetEncoding使用 1251 代码页解释字节并生成字符串（始终为 Unicode）。

例如，

        String parsedStream;
        var parsedPage = await WebDataCache.GetAsync(new Uri(String.Format("http://bash.im")));

        var buffer = await FileIO.ReadBufferAsync(parsedPage);
        using (var dr = DataReader.FromBuffer(buffer))
        {
            var bytes1251 = new Byte[buffer.Length];
            dr.ReadBytes(bytes1251);

            parsedStream = Encoding.GetEncoding("Windows-1251").GetString(bytes1251, 0, bytes1251.Length);
        }

挑战在于您无法从存储的字节中知道代码页是什么，因此它可以在此处使用，但可能不适用于其他站点。一般来说，UTF-8 是您从网络上获得的，但并非总是如此。此页面的 Content-Type 响应标头显示代码页，但该信息未存储在文件中。

c# - 从 ANSII 编码的文件中读取文本

1 回答 1

Related

Reference