I want to use the NLTK chunker for Tamil language (which is an Indic language). However, it says that it doesn't support Unicode because it uses the 'pre' module for regular expressions.
Unresolved Issues
If we use the
remodule for regular expressions, Python's regular expression engine generates "maximum recursion depth exceeded" errors when processing very large texts, even for regular expressions that should not require any recursion. We therefore use thepremodule instead. But note thatpredoes not include Unicode support, so this module will not work with unicode strings.
Any suggestion for a work around or another way to accomplish it?