8
4

3 回答 3

7

Here is a short presentation of the different actors:

  • ASCII is both a set of characters (there are 127 of them) and a code to represent them (on 7 bits).

  • Unicode is a set of characters (there are a lot more than 127).

  • UTF-8 is a code to represent unicode characters.

  • Your terminal. It interprets bytes output by your program as UTF-8 encoded characters and displays the corresponding unicode characters.

  • OCaml process sequences of bytes (OCaml uses the name char but it is misleading and the name byte would be more appropriate).

So if OCaml outputs the sequence of bytes corresponding to the UTF-8 code for "你好", your terminal will interpret it as a utf-8 string and will output 你好. But for OCaml, "你好" is just a sequence of 6 bytes.

于 2013-04-24T19:57:01.123 回答
3

TörökEdwin told you everything you need to know, I think. UTF-8 is specifically designed as a way to store Unicode values (codepoints) in a series of 8-bit bytes when the code is used to dealing with ASCII C strings. Since OCaml strings are a series of 8-bit bytes there's no problem storing a UTF-8 value there. If the program you use to create your OCaml source handles UTF-8, then it will have no trouble creating a string containing a UTF-8 value. You don't need to do anything special to get that to happen. (As I said I've done this many times myself.)

If you don't need to process the value, then the OCaml I/O functions can also write out such a value (or read one in), and if the encoding of your display is UTF-8 (which is what I use), it will display correctly. But most often you will need to process your values. If you change your code to (for example) just write out the length of the string, you might start to see why you would need a special library for handling UTF-8.

If you wonder why a certain Unicode string is represented as a certain series of bytes in the UTF-8 encoding you just need to read up on UTF-8. The Wikipedia article (UTF-8) might be a reasonable place to start.

于 2013-04-25T13:43:39.477 回答
2

You need to use an UTF8 library only if you want to convert between different encoding, to normalize unicode, or if you want to access individual codepoints.

OCaml treats strings as 8-bit binary values of a specified length, so you can use any encoding directly. i.e. you can just assign the UTF8 value directly to a variable:

# let foo = "こんにちは";;
val foo : string =
  "\227\129\147\227\130\147\227\129\171\227\129\161\227\129\175"
于 2013-04-24T16:16:52.897 回答