I am experiencing the following behavior in Python 2.7:
>>> a1 = u'\U0001f04f' #1
>>> a2 = u'\ud83c\udc4f' #2
>>> a1 == a2 #3
False
>>> a1.encode('utf8') == a2.encode('utf8') #4
True
>>> a1.encode('utf8').decode('utf8') == a2.encode('utf8').decode('utf8') #5
True
>>> u'\ud83c\udc4f'.encode('utf8') #6
'\xf0\x9f\x81\x8f'
>>> u'\ud83c'.encode('utf8') #7
'\xed\xa0\xbc'
>>> u'\udc4f'.encode('utf8') #8
'\xed\xb1\x8f'
>>> '\xd8\x3c\xdc\x4f'.decode('utf_16_be') #9
u'\U0001f04f'
What is the explanation for this behavior? More specifically:
- I'd expect two strings to be equal if statement #5 is true, while #3 proves otherwise.
- Encoding both code points together like in statement #6 yields results different from when encoded one by one in #7 and #8. Looks like the two code points are treated as one 4-byte code point. But what if I actually want them to be treated as two different code points?
- As you can see from #9 the numbers in
a2
are actuallya1
encoded using UTF-16-BE but although they were specified as Unicode code points using\u
inside a Unicode string (!), Python still could somehow get to equality in #5. How could it be possible?
Nothing makes sense here! What's going on?