2

If I have a String which is delimited by a character, let's say this:

a-b-c

and I want to keep the delimiters, I can use look-behind and look-ahead to keep the delimiters themselves, like:

string.split("((?<=-)|(?=-))");

which results in

  • a
  • -
  • b
  • -
  • c

Now, if one of the delimiters is escaped, like this:

a-b\-c

And I want to honor the escape, I figured out to use a regex like this:

((?<=-(?!(?<=\\-))) | (?=-(?!(?<=\\-))))  

ergo

string.split("((?<=-(?!(?<=\\\\-)))|(?=-(?!(?<=\\\\-))))"):

Now, this works and results in:

  • a
  • -
  • b\-c

(The backslash I'd later remove with string.replace("\\", "");, I haven't found a way to include that in the regex)

My Problem is one of understanding.
The way I understood it, the regex would be, in words,

split ((if '-' is before (unless ('\-' is before))) or (if '-' is after (unless ('\-' is before))))

Why shouldn't the last part be "unless \ is before"? If '-' is after, that means we're between '\' and '-', so only \ should be before, not \\-, but it doesn't work if I change the regex to reflect that like this:

((?<=-(?!(?<=\\-))) | (?=-(?!(?<=\\))))  

Result: a, -, b\, -c

What is the reason for this? Where is my error in reasoning?

4

2 回答 2

2

虽然这并不能真正回答问题,但这解释了环视是如何工作的。

Lookarounds 是锚点:它们不使用文本,而是在输入文本中找到一个位置。您的正则表达式可以用更简单的方式编写:

(?<=-)(?<!\\-)|(?=-)(?<!\\)

您在这里有所有四种环视:正面和负面的lookbehind,正面和负面的lookahead。

完整的正则表达式如下:

(?<=-)            # Find a position where what precedes is a dash
(?<!\\-)          # Find a position where what precedes is not \-
|                 # Or
(?=-)             # Find a position where what follows is a dash
(?<!\\)           # Find a position where what precedes is not a \

请注意术语“位置”。请注意,锚点根本不会在文本中前进。

现在,如果我们尝试将该正则表达式与a-b\-c

# Step 1
# Input:    | a-b\-c|
# Position: |^      |
# Regex:    | (?<=-)(?<!\\-)|(?=-)(?<!\\)|
# Position: |^                           |
# No match, try other alternative
# Input:    | a-b\-c|
# Position: |^      |
# Regex:    |(?<=-)(?<!\\-)| (?=-)(?<!\\)|
# Position: |               ^            |
# No match, regex fails
# Advance one position in the input text and try again

# Step 2
# Input:    |a -b\-c|
# Position: | ^     |
# Regex:    | (?<=-)(?<!\\-)|(?=-)(?<!\\)|
# Position: |^                           |
# No match, try other alternative
# Input:    |a -b\-c|
# Position: | ^     |
# Regex:    |(?<=-)(?<!\\-)| (?=-)(?<!\\)|
# Position: |               ^            |
# Match: a "-" follows
# Input:    |a -b\-c|
# Position: | ^     |
# Regex:    |(?<=-)(?<!\\-)|(?=-) (?<!\\)|
# Position: |                    ^       |
# Match: what precedes is not a \
# Input:    |a -b\-c|
# Position: | ^     |
# Regex:    |(?<=-)(?<!\\-)|(?=-)(?<!\\) |
# Position: |                           ^|
# Regex is satisfied

这是一个不使用拆分且没有环视的替代方案:

[a-z]+(\\-[a-z]+)*|-

您可以在 a 中使用此正则表达式Pattern并使用 a Matcher

public static void main(final String... args)
{
    final Pattern pattern
        = Pattern.compile("[a-z]+(\\\\-[a-z]+)*|-");

    final Matcher m = pattern.matcher("a-b\\-c");
    while (m.find())
        System.out.println(m.group());
}
于 2013-07-09T10:54:16.657 回答
1

为什么最后一部分不应该是“除非 \ 在之前”?

(?=-(?!(?<=\\-)))) 
    ^here

光标在之后-所以"unless \ is before"总是错误的,因为我们总是-在当前位置之前。


也许更简单的正则表达式会

(?<=(?<!\\\\)-)|(?=(?<!\\\\)-)

  • (?<=(?<!\\\\)-)将检查我们是否在之后-没有\之前。
  • (?=(?<!\\\\)-)将检查我们是否在之前-没有\之前。
于 2013-07-09T10:54:40.267 回答