java - Finding a simple pattern in a string unless escaped

Question

I have some code that looks for a simple bold markup

private Pattern bold = Pattern.compile("\\*[^\\*]*\\*")

If someone uses: this my *bolded* text - my pattern would find "bolded"

I now need a way to use * not in the context of bolding. So I'd like to allow escaping.

E.g. this my \*non-bolded\* text - should not find any pattern.

Is there a simple way I can change my Regex to achieve this?

score 5 · Accepted Answer

You need a negative lookbehind here:

(?<!\\)\*[^*]+(?<!\\)\*

In a Java string, this gives (backslash galore):

"(?<!\\\\)\\*[^*]+(?<!\\\\)\\*"

Note: the star (*) has no special meaning within a character class, therefore there is no need to escape it

Note 2: (?<!...) is a negative lookbehind; it is an anchor, which means it finds a position but consumes no text. Literally, it can be translated as: "find a position where there is no preceding text matching regex ...". Other anchors are:

^: find a position where there is no available input before (ie, can only match at the beginning of the input);
$: find a position where there is no available input after (ie, can only match at the end of the input);
(?=...): find a position where the following text matches regex ... (this is called a positive lookahead);
(?!...): find a position where the following text does not match regex ... (this is called a negative lookahead);
(?<=...): find a position where the preceding text matches regex ... (this is a positive lookbehind);
\<: find a position where the preceding input is either nothing or a character which is not a word character, and the following character is a word character (implementation dependent);
\>: find a position where the following input is either nothing or a character which is not a word character, and the preceding character is a word character (implementation dependent);
\b: either \< or \>.

Note 3: Javascript regexes do not support lookbehinds; neither do they support \< or \>. More information here.

Note 4: with some regex engines, it is possible to alter the meaning of ^ and $ to match positions at the beginning and end of each line instead; in Java, that is Pattern.MULTILINE; in Perl-like regex engines, that is /m.

score 3 · Accepted Answer

This negative lookbehind based regex should work for you:

(?<!\\)\*[^*]+\*(?<!\\)

Live Demo: http://www.rubular.com/r/sobKUrkTjP

When translated to Java it will become:

(?<!\\\\)\\*[^*]+\\*(?<!\\\\)

score 1 · Accepted Answer

I think the two answers until now are very interesting, but not completely correct. They don't work when a bolded text has escaped asterisk inside (I assume this is almost the main reason to escape asterisks).

For example:

My *bold \*text* here, another *bold*, more \* and *here\* and \* end* more text

Should find three groups:

*bold \*text*

*bold*

*here\* and \* end*

With a little modification, we can do that, with this regular expression:

(?<!\\)\*([^*\\]|\\\*)+\*

can be tested here: http://www.rubular.com/r/Jeml02HHYJ

Of course, in Java some more escaping is needed:

(?<!\\\\)\\*([^*\\\\]|\\\\\\*)+\\*

java - Finding a simple pattern in a string unless escaped

3 回答 3

Live Demo: http://www.rubular.com/r/sobKUrkTjP

Related

Reference