delphi - How to convert AnsiChar to UnicodeChar with specific CodePage?

Question

I'm generating texture atlases for rendering Unicode texts in my app. Source texts are stored in ANSI codepages (1250, 1251, 1254, 1257, etc). I want to be able to generate all the symbols from each ANSI codepage.

Here is the outline of the code I would expect to have:

for I := 0 to 255 do
begin
  anChar := AnsiChar(I); //obtain AnsiChar

  //Apply codepage without converting the chars
  //<<--- this part does not work, showing:
  //"E2033 Types of actual and formal var parameters must be identical"
  SetCodePage(anChar, aCodepages[K], False);

  //Assign AnsiChar to UnicodeChar (automatic conversion)
  uniChar := anChar;

  //Here we get Unicode character index
  uniCode := Ord(uniChar);
end;

The code above does not works (E2033) and I'm not sure it is a proper solution at all. Perhaps there's much shorter version.

What is the proper way of converting AnsiChar into Unicode with specific codepage in mind?

score 3 · Accepted Answer

I would do it like this:

function AnsiCharToWideChar(ac: AnsiChar; CodePage: UINT): WideChar;
begin
  if MultiByteToWideChar(CodePage, 0, @ac, 1, @Result, 1) <> 1 then
    RaiseLastOSError;
end;

I think you should avoid using strings for what is in essence a character operation. If you know up front which code pages you need to support then you can hard code the conversions into a lookup table expressed as an array constant.

Note that all the characters that are defined in the ANSI code pages map to Unicode characters from the Basic Multilingual Plane and so are represented by a single UTF-16 character. Hence the size assumptions of the code above.

However, the assumption that you are making, and that this answer persists, is that a single byte represents a character in an ANSI character set. That's a valid assumption for many character sets, for example the single byte western character sets like 1252. But there are character sets like 932 (Japanese), 949 (Koren) etc. that are double byte character sets. Your entire approach breaks down for those code pages. My guess is that only wish to support single byte character sets.

If you are writing cross-platform code then you can replace MultiByteToWideChar with UnicodeFromLocaleChars.

score 2 · Accepted Answer

You can also do it in one step for all characters. Here is an example for codepage 1250:

var
  encoding: TEncoding;
  bytes: TBytes;
  unicode: TArray<Word>;
  I: Integer;
  S: string;
begin
  SetLength(bytes, 256);
  for I := 0 to 255 do
    bytes[I] := I;
  SetLength(unicode, 256);

  encoding := TEncoding.GetEncoding(1250); // change codepage as needed
  try
    S := encoding.GetString(bytes);
    for I := 0 to 255 do
      unicode[I] := Word(S[I+1]); // as long as strings are 1-based
  finally
    encoding.Free;
  end;
end;

score 0 · Accepted Answer

Here is the code I have found to be working well:

var
  I: Byte;
  anChar: AnsiString;
  Tmp: RawByteString;
  uniChar: Char;
  uniCode: Word;
begin
  for I := 0 to 255 do
  begin
    anChar := AnsiChar(I);
    Tmp := anChar;
    SetCodePage(Tmp, aCodepages[K], False);
    uniChar := UnicodeString(Tmp)[1];
    uniCode := Word(uniChar);

    <...snip...>
  end;

delphi - How to convert AnsiChar to UnicodeChar with specific CodePage?

3 回答 3

Related

Reference