在这里你有一个很好的解释:
http://www.regular-expressions.info/unicode.html
一些提示:
“不幸的是,Java 和 .NET 不支持\X
(尚)。\P{M}\p{M}*
用作替代品。要匹配任意数量的字素,请使用(?:\P{M}\p{M}*)+
代替\X+
.”
"In Java, the regex token \uFFFF
only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFF
is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0")
will match both the single-code-point and double-code-point encodings of à
, while Pattern.compile("\\u00E0")
matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à
, while the latter compiles \u00E0
. Depending on what you're doing, the difference may be significant."