c# - c#获取anc更改文件编码

Question

我对文件编码有点困惑。我想改变它。这是我的代码：

public class ChangeFileEncoding
    {
        private const int BUFFER_SIZE = 15000;

        public static void ChangeEncoding(string source, Encoding destinationEncoding)
        {
            var currentEncoding = GetFileEncoding(source);
            string destination = Path.GetDirectoryName(source) +@"\"+ Guid.NewGuid().ToString() + Path.GetExtension(source);
            using (var reader = new StreamReader(source, currentEncoding))
            {
                using (var writer =new StreamWriter(File.OpenWrite(destination),destinationEncoding ))
                {
                    char[] buffer = new char[BUFFER_SIZE];
                    int charsRead;
                    while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
                    {
                        writer.Write(buffer, 0, charsRead);                        
                    }
                }
            }
            File.Delete(source);
            File.Move(destination, source);
        }

        public static Encoding GetFileEncoding(string srcFile)
        {
            using (var reader = new StreamReader(srcFile))
            {
                reader.Peek();
                return reader.CurrentEncoding;
            }
        }
    }

在 Program.cs 我有代码：

    string file = @"D:\path\test.txt";
    Console.WriteLine(ChangeFileEncoding.GetFileEncoding(file).EncodingName);
    ChangeFileEncoding.ChangeEncoding(file, new System.Text.ASCIIEncoding());
    Console.WriteLine(ChangeFileEncoding.GetFileEncoding(file).EncodingName);

我的控制台中打印的文本是：

统一码 (UTF-8)

统一码 (UTF-8)

为什么文件的编码没有改变？我在更改文件的编码时错了？

问候

score 1 · Accepted Answer

StreamReader 类在其构造函数中未传递 Encoding 时，将尝试自动检测文件的编码。当文件以 BOM 开头时，它会很好地执行此操作（并且您应该在更改文件的编码时编写序言，以便下次读取文件时方便）。

正确检测文本文件的编码是一个难题，尤其是对于非 Unicode 文件或没有 BOM 的 Unicode 文件。阅读器（无论是 StreamReader、Notepad++ 还是任何其他阅读器）都必须猜测文件中使用了哪种编码。

另请参阅如何检测文本文件的编码/代码页，强调我的：

您无法检测到代码页，您需要被告知。您可以分析字节并猜测它，但这可能会产生一些奇怪（有时很有趣）的结果。

因为 ASCII（字符 0-127）是 Unicode 的子集，所以可以安全地读取具有单字节 Unicode 编码（即 UTF-8）的 ASCII 文件。因此使用该编码的 StreamReader。

也就是说，只要它是真正的 ASCII。高于代码点 127 的任何字符都将是 ANSI，然后您就会进入检测猜测正确代码页的乐趣。

所以回答你的问题：你已经改变了文件的编码，根本没有“检测”它的万无一失的方法，你只能猜测它。

必读材料：每个软件开发人员绝对、肯定必须了解 Unicode 和字符集（没有借口！）以及Unicode、UTF、ASCII、ANSI 格式差异的绝对最低要求。

score 0 · Accepted Answer

检测 usingStreamReader.CurrentEncoding有点棘手，因为这不会说明文件使用什么编码，而是说明StreamReader读取它需要什么编码。基本上，如果没有 BOM 而不读取整个文件（并分析在那里找到的内容，这并非易事），就没有简单的方法来检测编码。

对于带有 BOM 的文件，这很容易：

public static Encoding GetFileEncoding(string srcFile)
{
   var bom = new byte[4];
   using (var f = new FileStream(srcFile, FileMode.Open, FileAccess.Read))
     f.Read(bom, 0, 4);

   if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
   if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
   if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode;
   if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode;
   if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
   // No BOM, so you choose what to return... the usual would be returning UTF8 or ASCII
   return Encoding.UTF8;
}

c# - c#获取anc更改文件编码

2 回答 2

Related

Reference