If you get a unicode error, it is sometimes hard to find the root of the problem. Where does this string come from?
Is there a way to show the string (or part of buggy string)?
You can use this snippet:
try:
html = html.decode(encoding)
except UnicodeError as exc:
re_raise_unicode_error_with_hint(exc)
def re_raise_unicode_error_with_hint(exc):
hint = exc.object[max(exc.start - 15, 0):min(exc.end + 15, len(exc.object))]
raise exc.__class__(exc.encoding, exc.object, exc.start, exc.end, 'hint: %r' % hint)
This way you see 15 chars before and 15 chars after the unicode error of your string.