We have some text in ISO-8859-15 for which we want to tokenize. (ISO-8859-15 is ISO-8859-1 with the Euro sign and other common accented characters, for more details see ISO-8859-15).
I am trying to get the parser to recognize all the characters. The native character representation of the text editors I'm using is UTF-8, so to avoid hidden conversion problems, I'm restricting all re2c
code to ASCII e.g.
LATIN_CAPITAL_LETTER_A_WITH_GRAVE = "\xc0" ;
LATIN_CAPITAL_LETTER_A_WITH_ACUTE = "\xc1" ;
LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX = "\xc2" ;
LATIN_CAPITAL_LETTER_A_WITH_TILDE = "\xc3" ;
...
Then:
UPPER = [A-Z] | LATIN_CAPITAL_LETTER_A_WITH_GRAVE
| LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX
| LATIN_CAPITAL_LETTER_AE
| LATIN_CAPITAL_LETTER_C_WITH_CEDILLA
| ...
WORD = UPPER LOWER* | LOWER+ ;
It compiles no problem and runs great on ASCII, but stalls whenever it hits these extended characters.
Has anyone seen this, and is there a way to fix it?
Thank you,
Yimin