regex - Using re2c with ISO-8859-x

Question

We have some text in ISO-8859-15 for which we want to tokenize. (ISO-8859-15 is ISO-8859-1 with the Euro sign and other common accented characters, for more details see ISO-8859-15).

I am trying to get the parser to recognize all the characters. The native character representation of the text editors I'm using is UTF-8, so to avoid hidden conversion problems, I'm restricting all re2c code to ASCII e.g.

LATIN_CAPITAL_LETTER_A_WITH_GRAVE      = "\xc0" ;
LATIN_CAPITAL_LETTER_A_WITH_ACUTE      = "\xc1" ;
LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX = "\xc2" ;
LATIN_CAPITAL_LETTER_A_WITH_TILDE      = "\xc3" ;
...

Then:

UPPER    = [A-Z] | LATIN_CAPITAL_LETTER_A_WITH_GRAVE
                 | LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX
                 | LATIN_CAPITAL_LETTER_AE
                 | LATIN_CAPITAL_LETTER_C_WITH_CEDILLA
                 | ...

WORD     = UPPER LOWER* | LOWER+ ;

It compiles no problem and runs great on ASCII, but stalls whenever it hits these extended characters.

Has anyone seen this, and is there a way to fix it?

Thank you,

Yimin

score 3 · Accepted Answer

是的，我已经看到了。与字节 ≥ 128 的有符号与无符号类型的比较有关。

两种修复方法：unsigned char用作您的默认类型，例如 re2c:define:YYCTYPE = "unsigned char";，或-funsigned-char（如果使用gcc，其他编译器具有等效的）作为编译标志。您可以使用对现有代码干扰最少的一种。

regex - Using re2c with ISO-8859-x

1 回答 1

Related

Reference