I am copying strings containing the word cafe
(but with an accented e
) from a javascript source file into a python script where I need to do some processing over the data and then output some JSON. I am having some trouble getting my head around the encoding/decoding details though. This is perhaps best illustrated with an example:
$ python
>>> import urllib2, json
>>> the_name = "Tasty Caf%C3%E9"
>>> the_name
'Tasty Caf%C3%E9'
>>> the_name_unquoted = urllib2.unquote(the_name)
>>> the_name_unquoted
'Tasty Caf\xc3\xe9'
>>> json.dumps({'bla': the_name_unquoted})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 9: invalid continuation byte
I've spent some time trying to understand how encodings work, though clearly I'm not getting it. Exactly what encoding/format (any other appropriate terminology here?) is the_name_unquoted
in above and what is it about it that utf8 cannot decode correctly?