python - String to unicode code point escpe sequences in Python

Question

Here is my problem... I have a "normal" String like :

Hello World

And unlike all the other subjects I have found, I WANT to print it as it's Unicode Codepoint Escape value !

The output I am looking for is something like this:

\u0015\u0123

If anyone has an idea :)

score 1 · Accepted Answer

You are encoding ASCII codepoints only. UTF-8 is a superset of ASCII, any ASCII codepoints are encoded to the same bytes as ASCII would use. What you are printing is correct, that is UTF-8.

Use some non-ASCII codepoints to see the difference:

>>> 'Hello world with an em-dash: \u2014\n'.encode('utf8')
b'Hello world with an em-dash: \xe2\x80\x94\n'

Python will just use the characters themselves when it shows you a bytes value with printable ASCII bytes in it. Any byte value that is not printable is shown as a \x.. escape code, or a single-character escape sequence if there is one (\n for newline).

From your example output, on the other hand, you seem to be expecting to output Python unicode literal escape codes:

>>> '\u0015\u0123'
'\x15ģ'

Since U+0123 is printable, Python 3 just shows it; the non-printable U+0015 (NEGATIVE ACKNOWLEDGE) is a codepoint in the 0x00-0xFF range and is shown using the shorter \x.. escape notation.

To show only unicode escape sequences for your text, you need to process it character by character:

>>> input_text = 'Hello World!'
>>> print(''.join('\\u{:04x}'.format(ord(c)) for c in input_text))
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064\u0021
>>> input_text = 'Hello world with an em-dash: \u2014\n'
>>> print(''.join('\\u{:04x}'.format(ord(c)) for c in input_text))
\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u006f\u0072\u006c\u0064\u0020\u0077\u0069\u0074\u0068\u0020\u0061\u006e\u0020\u0065\u006d\u002d\u0064\u0061\u0073\u0068\u003a\u0020\u2014\u000a

It is important to stress that this is not UTF-8, however.

score 0 · Accepted Answer

You can use ord to the encoded bytes into numbers and use string formatting you display their hex values.

>>> s = u'Hello World \u0664\u0662'
>>> print s
Hello World ٤٢
>>> print ''.join('\\x%02X' % ord(c) for c in s.encode('utf-8'))
\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x20\xD9\xA4\xD9\xA2

python - String to unicode code point escpe sequences in Python

2 回答 2

Related

Reference