c# - 使用（多个/混合/多个）编码读取文本文件

Question

我有一个具有多个编码的文本文件，其中要使用的编码本身在文本文件中指定（vCard 格式是一个允许这样做的示例）。这是一个例子：

charset=windows-1251: ABCDE
charset=utf-8: VWXYZ

...其中“ABCDE”将被解释为编码“windows-1251”，而“VWXYZ”将采用 UTF8。最终，我希望它全部转换为标准字符串（在 C# 中为 UTF2/UTF16）。

我想我想使用 RealAllText() 因为如果没有另外指定，这显然有助于自动使用默认编码。当如上所述指定字符集时，它将覆盖默认编码。

不幸的是，我还需要进行一些文本解析来查找各种编码，所以我认为需要 ReadAllBytes()，所以我可以以更原始的格式逐个字符地解析。

我也希望它快点。处理这个问题的最佳方法是什么？

score 2 · Accepted Answer

假设有关编码的所有元数据都将采用 ASCII 格式，您可以使用一些宽松的基于单字节的编码对其进行解码，这样您就可以像往常一样解析文本。然后使用适当的编码重新解析（从字节）每个字符串。

一些愚蠢的示例代码：

var encoding = Encoding.GetEncoding("Windows-1252");
string asString = System.IO.File.ReadAllText("C:/Temp/test.txt", encoding);
byte[] asBytes = System.IO.File.ReadAllText("C:/Temp/test.txt");

foreach(var entry in ParseFile(aString))
{
    int start = entry.PositionInString;
    // Since we used a one-byte encoding, we can use this location
    // directly in the byte-array.

    int length = entry.Length;
    string encoding = entry.Encoding;
    string decodedEntry = Encoding.GetEncoding(encoding)
                                  .GetString(bytes, start, length);
    Console.WriteLine(decodedEntry);
}

c# - 使用（多个/混合/多个）编码读取文本文件

1 回答 1

Related

Reference