I want to use the NLTK chunker for Tamil language (which is an Indic language). However, it says that it doesn't support Unicode because it uses the 'pre' module for regular expressions.
Unresolved Issues
If we use the
re
module for regular expressions, Python's regular expression engine generates "maximum recursion depth exceeded" errors when processing very large texts, even for regular expressions that should not require any recursion. We therefore use thepre
module instead. But note thatpre
does not include Unicode support, so this module will not work with unicode strings.
Any suggestion for a work around or another way to accomplish it?