You are encoding ASCII codepoints only. UTF-8 is a superset of ASCII, any ASCII codepoints are encoded to the same bytes as ASCII would use. What you are printing is correct, that is UTF-8.
Use some non-ASCII codepoints to see the difference:
>>> 'Hello world with an em-dash: \u2014\n'.encode('utf8')
b'Hello world with an em-dash: \xe2\x80\x94\n'
Python will just use the characters themselves when it shows you a bytes
value with printable ASCII bytes in it. Any byte value that is not printable is shown as a \x..
escape code, or a single-character escape sequence if there is one (\n
for newline).
From your example output, on the other hand, you seem to be expecting to output Python unicode literal escape codes:
>>> '\u0015\u0123'
'\x15ģ'
Since U+0123 is printable, Python 3 just shows it; the non-printable U+0015 (NEGATIVE ACKNOWLEDGE
) is a codepoint in the 0x00
-0xFF
range and is shown using the shorter \x..
escape notation.
To show only unicode escape sequences for your text, you need to process it character by character:
>>> input_text = 'Hello World!'
>>> print(''.join('\\u{:04x}'.format(ord(c)) for c in input_text))
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064\u0021
>>> input_text = 'Hello world with an em-dash: \u2014\n'
>>> print(''.join('\\u{:04x}'.format(ord(c)) for c in input_text))
\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u006f\u0072\u006c\u0064\u0020\u0077\u0069\u0074\u0068\u0020\u0061\u006e\u0020\u0065\u006d\u002d\u0064\u0061\u0073\u0068\u003a\u0020\u2014\u000a
It is important to stress that this is not UTF-8, however.