0
4

1 回答 1

4

There is confusion between UTF-8 and Unicode. Decoding %DB%81%DB%8C%D9%84%D9%88 will give you the bytes 0xDB 0x81 0xDB 0x8C 0xD9 0x84 0xD9 0x88. Apparently this is an UTF-8 encoded string consisting of arabic characters. If you read this as UTF-8 and decode it into Unicode code points you get:

  • 0xDB 0x81 → U+06C1 Arabic letter heh goal (ہ),
  • 0xDB 0x8C → U+06CC Arabic letter farsi yeh (ہی),
  • 0xD9 0x84 → U+0644 Arabic letter lam (ل),
  • 0xD9 0x88 → U+0648 Arabic letter waw (و).

This is not to be confused with what you are actually decoding into:

  • U+00DB Latin capital letter U with circumflex (Û),
  • U+0081
  • U+0084
  • U+0088
  • U+008C
  • U+00D9 Latin capital letter U with grave (Ù),

Furthemore, Unix filenames are actually composed of bytes, not characters. So this is up to you to choose how to display them.

So the easiest way for this case is probably to do the percent decoding into bytes, and create your directory using those bytes as filename, without translating them into actual Unicode characters.

Here is an example that works:

#include <sys/stat.h>
#include <string>

const int hex_value[256] = {
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0,
  0,10,11,12,13,14,15
};

int main(int argc, char **argv)
{
  std::string str("%DB%81%DB%8C%D9%84%D9%88");

  size_t t = 0;
  for (size_t s = 0; s < str.size(); s++) {
    if (str[s] == '%' && s + 2 < str.size()) {
      str[t] = hex_value[str[s+1]] * 16 + hex_value[str[s+2]];
      s += 2;
    }
    else
      str[t] = str[s];
    t++;
  }
  str.resize(t);

  mkdir(str.c_str(), 0755);
}

If you still see other stuff than ہیلو on your terminal, it could be your terminal that is confused as to the character set it should use. To clear up this possible confusion, pipe the output of ls through hexdump and make sure that you see the bytes you are expecting:

$ ls | hexdump
0000000 61 2e 6f 75 74 0a 63 6c 65 61 6e 2e 70 68 70 0a
…
00000c0 32 32 2e 63 6f 6d 3a 32 32 0a db 81 db 8c d9 84
00000d0 d9 88 0a                                       
00000d3

Here you can clearly see the correct filename db 81 db 8c d9 84 d9 88 0a at the end.

于 2013-01-18T07:26:16.887 回答