regex - Matching a character before or after a word, but not both in regex

Question

Lets say I need to match a word word where there may be a period before word or after word but not both. Then word, .word, and word. should be matched, but .word. should not be matched. How would I match this and capture what occurred before and after the word?

That was a simplified example that I'll need to extend to more complicated cases. For example, now the symbols . and ' may occur before or after the word, but they can only occur once. So for example, .word, 'word, word.', and .word' are just a few of the valid matches, but something like .'word.' shouldn't match, or even .'word'.

The above example is my main priority, but an added bonus would be the order in which the period and apostrophe are added. Thus '.word and .'word should both match. I think one way that should work for this is \.?'?|'?\.?word, but I was hoping for some way where the number of statements in the OR clause doesn't depend on the number of symbols.

score 0 · Accepted Answer

好的。它需要更多时间才能正确处理word出现在字符串开头或结尾的情况。

 "(?:\.word(?:[^.]|$))|(?:(?:[^.]|^)word(?:[^.]|$))|(?:(?:[^.]|^)word\.)"

Lookaheads和Lookbehinds相同regexp（在 python 中测试）：

 "(?:\.word(?:(?!\.)|$))|(?:(?:(?<!\.)|^)word(?:(?!\.)|$))|(?:(?:(?<!\.)|^)word\.)"

有用：

 re.findall(pattern(above), '.word. .word .word. word.'") // return ['.word ', ' word.']

score 0 · Accepted Answer

哪种口味？如果是 JavaScript，这应该可以工作：

(?:^|[^\w.'])(?=[.']*(word))(?!'*\.'*\1'*\.)(?!\.*'\.*\1\.*')([.']*)\1([.']*)

解释：

(?:^|[^\w.'])- 确保word不是较大单词的结尾，并防止正则表达式绕过前导分隔符（.或'）（如果它们在那里）。
(?=[.']*(word\b))- 确保word不是较长单词的开头，并且它前面除了您选择的分隔符之外什么都没有。这个时候这个词没有被消费，它只是在第 1 组中被捕获，所以它可以用来锚定接下来的两个前瞻。
(?!'*\.'*\1'*\.)- 仍然位于前导分隔符（如果有）之前，这可以确保如果.单词之前有一个，那么它之后就没有一个。
(?!\.*'\.*\1\.*')- 这对'.
([.']*)\1([.']*)- 最后，继续使用单词以及任何前导或尾随分隔符，捕获组#2 和#3 中的那些。

如果您使用的风格支持后视，它可能无济于事。大多数风味都对可以在后视中匹配的内容施加了严格的限制，使其对这项任务毫无用处。上面的 JavaScript 正则表达式可能仍然是您的最佳选择。但是，这个正则表达式在 .NET 和 JGSoft 中有效，这是我所知道的唯一支持完全不受限制的后视的风格：

(?<=(?:\.(?<dot1>)|'(?<apos1>))*)\bword\b(?=(?:\.(?<dot2>)|'(?<apos2>))*)(?!\k<dot1>\k<dot2>|\k<apos1>\k<apos2>)

解释：

(?<=(?:\.(?<dot1>)|'(?<apos1>))*)- 向后扫描分隔符。当每一个都匹配时，它后面的空捕获组有效地将该字符标记为已被看到。
\bword\b- 消耗这个词。
(?=(?:\.(?<dot2>)|'(?<apos2>))*)- 向前扫描以查找更多分隔符并检查它们，就像向后查找一样。
(?!\k<dot1>\k<dot2>|\k<apos1>\k<apos2>)- 断言单词前后都没有出现点和撇号。对空组的反向引用从不消耗任何字符，它只是断言该组已参与比赛。

在这两种风格之后，Java 的lookbehind 可能是最灵活的，但它也出了名的漏洞百出。我应该能够通过将第一个更改为来将此正则表达式移植到 Java *，{0,2}但它只会抛出“没有明显的最大长度”异常。同样，您最好使用上面与 JavaScript 兼容的正则表达式。

score 0 · Accepted Answer

这适用于您提供的好值和坏值的javascript。

var func = function (str) {
    var result = true, match, re = /^([^a-z]+)[a-z]+([^a-z]+)$/i;
    if (re.test(str)) {
        match = re.exec(str);
        re = new RegExp("[" + match[1] + "]");
        result = !re.test(match[2]);
    }
    return result;
};

这是一个简单的解释。如果字符串在字母之前和之后包含非字母，则非字母将被提取并相互测试。否定测试的结果，以确定这个词是好是坏。

str = .'word.
".'", "word", "."
/[.']/.test( "." )

该func函数需要一个单词（没有空格的字符）作为字符串。如果要检查一个句子，然后用空格分隔，然后检查每个单词。像这样的东西。

    var sentence = "What does .'words'. means?";
var words = sentence.split(/\s+/g);
    var areWordsOk;
for( var i = 0, len = words.length; i < len; i++ ){
    areWordsOk = func( words[i] );
    if( !areWordsOk ){
        throw new Error( "bad word." ); // error is thrown
    }
}

这是我的测试用例。现场演示：http: //jsfiddle.net/Tb68G/2 这是测试用例的来源。

var func = function (str) {
    var result = true, match, re = /^([^a-z]+)[a-z]+([^a-z]+)$/i;
    if (re.test(str)) {
        match = re.exec(str);
        re = new RegExp("[" + match[1] + "]");
        result = !re.test(match[2]);
    }
    return result;
};
test("test good values", function () {
    var arr = [
        "word",
        ".word",
        "word.",
        ".word",
        "'word",
        "word.'",
        ".word'"
    ];
    var i = arr.length,
    str;
    while (i--) {
        str = arr[i];
        equal(func(str), true, str + " should be true.");
    }
});
test("test bad values", function () {
    var arr = [
        ".word.",
        ".'word.'",
        ".'word'.",
        ".'word'"
    ];
    var i = arr.length,
    str;
    while (i--) {
        str = arr[i];
        equal(func(str), false, str + " should be false.");
    }
});

score -1 · Accepted Answer

我在想正则表达式是一个很酷的东西......
但有时，你需要使用其他方法，
看着如此可怕的表达简单的事情......

我说编码！

    int findWord(string text, string word, char ch, int startIdx = 0)
    {
        while(startIdx < text.Length)
        {
            int indexOf = text.IndexOf(word, startIdx);
            if (indexOf < 0) return -1;

            char preChar = (char) 0;
            char postChar = (char) 0;

            if (indexOf > 0)
                preChar = text[indexOf - 1];

            if (indexOf < text.Length - word.Length)
                postChar = text[indexOf + word.Length];


            if ((preChar == ch) ^ (postChar == ch))
            {
                return indexOf;
            }
            startIdx = indexOf + word.Length + 1;
        } 
    }

没那么简单，不止一行 :)
但性能更好，一两个月后阅读就可以理解。

regex - Matching a character before or after a word, but not both in regex

4 回答 4

Related

Reference