delphi - 无法使用 UTF-8 编码

Question

我使用此代码加载文本文件（我的文件编码为 UTF-8）（如何在 Delphi 中读取包含“NULL CHARACTER”的文本文件？）：

uses
IOUtils;

var
  s: string;
  ss: TStringStream;
begin
  s := TFile.ReadAllText('c:\MyFile.txt');
  s := StringReplace(s, #0, '', [rfReplaceAll]);  //Removes NULL CHARS
  ss := TStringStream.Create(s);

  try
    RichEdit1.Lines.LoadFromStream(ss, TEncoding.UTF8); //UTF8
  finally
    ss.Free;
  end;

end;

但我的问题是RichEdit1不会加载整个文本。这不是因为空字符。这是因为编码。当我使用此代码运行应用程序时，它会加载整个文本：

uses
IOUtils;

var
  s: string;
  ss: TStringStream;
begin
  s := TFile.ReadAllText('c:\MyFile.txt');
  s := StringReplace(s, #0, '', [rfReplaceAll]);  //Removes NULL CHARS
  ss := TStringStream.Create(s);

  try
    RichEdit1.Lines.LoadFromStream(ss, TEncoding.Default);
  finally
    ss.Free;
  end;

end;

我TEncoding.UTF8改为TEncoding.Default. 整个文本已加载，但格式不正确且不可读。

我猜有些字符是 UTF 8 不支持的。因此，当它想要加载该字符时，加载过程会停止。

请帮忙。任何解决方法？

****编辑：**

I'm sure its UTF-8 and it plain text. It's a HTML source file. I'm sure it has null charas I saw them using Notepad++ And the value of the Richedit.Plainext is true

score 14 · Accepted Answer

You should give the encoding to TFile.ReadAllText. After that you are working with Unicode strings only and don't have to bother with UTF8 in the RichEdit.

var
  s: string;
begin
  s := TFile.ReadAllText('c:\MyFile.txt', TEncoding.UTF8);
  // normally this shouldn't be necessary 
  s := StringReplace(s, #0, '', [rfReplaceAll]);  //Removes NULL CHARS
  RichEdit1.Lines.Text := s;

end;

score 2 · Accepted Answer

Since you are loading an HTML file, you need to pre-parse the HTML and check if its <head> tag contains a <meta> tag specifying a specific charset. If it does, you must load the HTML using that charset, or else it will not decode to Unicode correctly.

If there is no charset specified in the HTML, you have to choose an appropriate charset, or ask the user. For instance, if you are downloading the HTML from a webserver, you can check if a charset is specified in the HTTP Content-Type header, and if so then save that charset with (or even in) the HTML so you can use it later. Otherwise, the default charset for downloaded HTML is usually ISO-8859-1 unless known otherwise.

The only time you should ever load HTML as UTF-8 is if you know for a fact that the HTML is actually UTF-8 encoded. You cannot just blindly assume the HTML is UTF-8 encoded, unless you are the one who created the HTML in the first place.

From what you have described, it sounds like your HTML is not UTF-8. But it is hard to know for sure since you did not provide the file that you are trying to load.

delphi - 无法使用 UTF-8 编码

2 回答 2

Related

Reference