I believe my issue is that python does not play nicely with the character encoding of a column in a SQL table:
| column | varchar(255) | latin1_swedish_ci | YES | | NULL | | select,insert,update,references | |
The above shows the output for this column. It has type varchar(255)
and has encoding latin1_swedish_ci.
Now when I try to make python play with this data, I am getting the following error:
dictionary = gs.corpora.Dictionary(tweets)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/corpora/dictionary.py", line 50, in __init__
self.add_documents(documents)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/corpora/dictionary.py", line 97, in add_documents
_ = self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/corpora/dictionary.py", line 121, in doc2bow
document = sorted(utils.to_utf8(token) for token in document)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/corpora/dictionary.py", line 121, in <genexpr>
document = sorted(utils.to_utf8(token) for token in document)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.9.1-py2.7.egg/gensim/utils.py", line 164, in any2utf8
return unicode(text, encoding, errors=errors).encode('utf8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte
gs
is the gensim topic modeling library. I believe that the problem is that gensim requires unicode encodings.
- How can I change the character encoding (collation?) for this column in my database?
- Is there an alternative solution?
Thanks for all the help!