c++ - C++11 正则表达式中的 UTF-8 字符范围

Question

score 36 · Accepted Answer

Encoded in UTF-8, the string "[一-龠々〆ヵヶ]" is equal to this one: "[\xe4\xb8\x80-\xe9\xbe\xa0\xe3\x80\x85\xe3\x80\x86\xe3\x83\xb5\xe3\x83\xb6]". And this is not the ~~droid~~ character class you are looking for.

The character class you are looking for is the one that includes:

any character in the range U+4E00..U+9FA0; or
any of the characters 々, 〆, ヵ, ヶ.

The character class you specified is the one that includes:

any of the "characters" \xe4 or \xb8; or
any "character" in the range \x80..\xe9; or
any of the "characters" \xbe, \xa0, \xe3, \x80, \x85, \xe3 (again), \x80 (again), \x86, \xe3 (again), \x83, \xb5, \xe3 (again), \x83 (again), \xb6.

Messy isn't it? Do you see the problem?

This will not match "latin" characters (which I assume you mean things like a-z) because in UTF-8 those all use a single byte below 0x80, and none of those is in that messy character class.

It will not match "中" either because "中" has three "characters", and your regex matches only one "character" out of that weird long list. Try assert(std::regex_match("中", std::regex("..."))) and you will see.

If you add a + it works because "中" has three of those "characters" in your weird long list, and now your regex matches one or more.

If you instead add {1} it does not match because we are back to matching three "characters" against one.

Incidentally "中" matches "中" because we are matching the three "characters" against the same three "characters" in the same order.

That the regex with + will actually match some undesired things because it does not care about order. Any character that can be made from that list of bytes in UTF-8 will match. It will match "\xe3\x81\x81" (ぁ U+3041) and it will even match invalid UTF-8 input like "\xe3\xe3\xe3\xe3".

The bigger problem is that you are using a regex library that does not even have level 1 support for Unicode, the bare minimum required. It munges bytes and there isn't much your precious tiny regex can do about it.

And the even bigger problem is that you are using a hardcoded set of characters to specify "any Japanese Kanji or Chinese character". Why not use the Unicode Script property for that?

R"(\p{Script=Han})"

Oh right, this won't work with C++11 regexes. For a moment there I almost forgot those are annoyingly worse than useless with Unicode.

So what should you do?

You could decode your input into a std::u32string and use char32_t all over for the matching. That would not give you this mess, but you would still be hardcoding ranges and exceptions when you mean "a set of characters that share a certain property".

I recommend you forget about C++11 regexes and use some regular expression library that has the bare minimum level 1 Unicode support, like the one in ICU.

c++ - C++11 正则表达式中的 UTF-8 字符范围

1 回答 1

Related

Reference