c# - 在 C# 中处理非英文字符

Question

我需要正确理解字符集和编码。有人可以指出我在 C# 中处理不同字符集的好文章吗？

这是我面临的问题之一-

        using (StreamReader reader = new StreamReader("input.txt"))
        using (StreamWriter writer = new StreamWriter("output.txt")
        {
            while (!reader.EndOfStream)
            {
                writer.WriteLine(reader.ReadLine());
            }
        }

这个简单的代码片段并不总是保留编码 -

例如 -

输入中的 Aukéna 变成输出中的 Aukï¿½na。

score 5 · Accepted Answer

你只是有一个编码问题。你必须记住，你真正阅读的只是一个比特流。您必须告诉您的程序如何正确解释这些位。

要解决您的问题，只需使用也采用编码的构造函数，并将其设置为您的文本使用的任何编码。

http://msdn.microsoft.com/en-us/library/ms143456.aspx

http://msdn.microsoft.com/en-us/library/3aadshsx.aspx

score 2 · Accepted Answer

我想在读取文件时，您应该知道文件具有哪种编码。否则你很容易无法正确阅读。

当您知道文件的编码时，您可以执行以下操作：

        using (StreamReader reader = new StreamReader("input.txt", Encoding.GetEncoding(1251)))
        using (StreamWriter writer = new StreamWriter("output.txt", false, Encoding.GetEncoding(1251)))
        {
            while (!reader.EndOfStream)
            {
                writer.WriteLine(reader.ReadLine());
            }
        }

如果要更改文件的原始编码，则会出现另一个问题。

以下文章可能会为您提供关于什么是编码的良好基础：每个软件开发人员绝对、肯定必须了解 Unicode 和字符集的绝对最低要求（没有借口！）

这是一篇链接 msdn 文章，您可以从中开始：编码类

score 2 · Accepted Answer

StreamReader.ReadLine()尝试使用 UTF 编码读取文件。如果这不是您的文件使用的格式，StreamReader 将无法正确读取字符。

本文详细介绍了该问题并建议将这种编码传递给构造函数System.Text.Encoding.Default。

score 0 · Accepted Answer

You could always create your own parser. What I use is:

`var ANSI = (Encoding) Encoding.GetEncoding(1252).Clone();

ANSI.EncoderFallback = new EncoderReplacementFallback(string.Empty);`

The first line of this creates a clone of the Win-1252 encoding (as the database I deal with works with Win-1252, you'd probably want to use UTF-8 or ASCII). The second line - when parsing characters - returns an empty string if there is no equivalent to the original character.

After this you'd want to preferably filter out all command characters (excluding tabs, spaces, line feeds and carriage returns depending on what you need).

Below is my personal encoding-parser which I set up to correct data entering our database.

private string RetainOnlyPrintableCharacters(char c)
{
//even if the character comes from a different codepage altogether, 
//if the character exists in 1252 it will be returned in 1252 format.
    var ansiBytes = _ansiEncoding.GetBytes(new char[] {c});

    if (ansiBytes.Any())
    {
        if (ansiBytes.First().In(_printableCharacters))
        {
            return _ansiEncoding.GetString(ansiBytes);
        }
    }
    return string.Empty;
}

_ansiEncoding comes from the var ANSI = (Encoding) Encoding.GetEncoding(1252).Clone(); with the fallback value set

if ansiBytes is not empty, it means that there is an encoding available for that particular character being passed in, so it is compared with a list of all the printable characters and if it exists - it is an acceptable character so is returned.

c# - 在 C# 中处理非英文字符

4 回答 4

Related

Reference