python - 匹配一个关键字的正则表达式，该关键字出现在另一个关键字之前

Question

我需要找出“测试”一词出现在“跟随”之后的时间，中间没有另一个“测试”。

例子：

test
word
word
word
test
test
word
word
follow
word
word
test

我只想要这个：

test
word
word
word
test
**test**
**word**
**word**
**follow**
word
word
test

不过，我对正则表达式还不够熟悉，无法做到这一点。任何建议都会很棒。

编辑虽然单词 test 将在那里多次出现，但单词 follow 只会在字符串中出现一次。

score 2 · Accepted Answer

您需要您的正则表达式在此处使用前瞻。

test(?:\w|\s(?!test))+?follow

(?:)是非捕获组。\w匹配任何单词字符[a-zA-Z0-9_]。\s匹配任何空格（包括新行）。\s(?!test)仅匹配不跟随的换行符test（在正则表达式中称为负前瞻）。()+?只是使比赛不贪婪。

测试输入匹配：

test
word
**test**
**word**
**follow**
word
test
**test**
**word**
**word**
**follow**
word
word
**test**
**word**
**follow**

以下正则表达式也消除了任何子字符串匹配（如测试中的测试、抗议等）。

(?<!\w)(test)\s(?!\1\s)(?:\w|\s(?!\1\s))*?(?<!\w)follow(?!\w)

score 1 · Accepted Answer

为简单起见，我个人不会在这里使用正则表达式：

text = (
"""test
word
word
word
test
test
word
word
follow
word
word
test
"""
)

def find_patterns(text):
    curr = []
    for word in text.split('\n'):
        if word == 'test':
            curr = ['word']  # start found sequence (also resets an existing one)
        else:
            if curr:  # if a sequence has been started by 'test'
                curr.append(word)  # otherwise just add to current sequence
                if word == 'follow':  # end of sequence
                    yield curr  # yield one result
                    curr = []  # reset sequence

print list(find_patterns(text))

输出：

 [['test', 'word', 'word', 'follow']]

score 0 · Accepted Answer

Ravi 的正则表达式模式在某些情况下会产生错误的结果。
例子：

import re
s = """test
word 1
word 2
word 3
test 
tutulululalalo
testimony
word A
word B
follow
word X
word Y
test
"""

pat = ('test(?:\w|\s(?!test))+?follow')
print re.findall(pat,s)
#
#result : ['testimony\nword A\nword B\nfollow']

模式必须是：

pat = ('test(?=\s)'  '(?:\w|\s(?!test(?=\s)))+?'  'follow')
print re.findall(pat,s)
#
#result : ['test \ntutulululalalo\ntestimony\nword A\nword B\nfollow']

此外，我看不到 OR 表达式的兴趣。这有效：

pat = ('(test(?=\s)'  '(?:.(?!test(?=\s)))+?'  'follow)')
print re.findall(pat,s,re.DOTALL)
#
#result : ['test \ntutulululalalo\ntestimony\nword A\nword B\nfollow']

最后，我更喜欢下面的模式，因为它只通过一次验证开始“测试”和结束“跟随”之间没有“测试”，同时
验证每个字符是否跟随“跟随”：'(?:\w|\s(?!test(?=\s)))+?''(?:.(?!test(?=\s)))+?'

pat = ('test(?=\s)'
       '(?!.+?test(?=\s).*?follow)'
       '.+?'
       'follow')
print re.findall(pat,s,re.DOTALL)
#
#result : ['test \ntutulululalalo\ntestimony\nword A\nword B\nfollow']

.

编辑 1

正如 Ravi Thapliyal 指出的那样，我的最后一个正则表达式模式

pat = ('test(?=\s)'
       '(?!.+?test(?=\s).*?follow)'
       '.+?'
       'follow')

也不完美。
我尝试过这种模式，因为我从不喜欢这种模式(?!.(?=something))+
我的最后一个正则表达式模式应该替换这个不喜欢的模式。
好吧，它不起作用，我所有的努力都没有成功，尽管在我看来，从前我确实使用了这种模式，并带有一些微妙的附加部分，使它起作用。
唉，我没有成功，我想我会放弃有一天它可能会奏效的想法。
所以我决定放弃我古老的想法，明确地认为我不喜欢的模式是最明显、最容易理解和易写的模式。

.

现在我要承认第二个错误：我发现 Ravi Thapliyal 的正则表达式模式在某些情况下不起作用，但我没有考虑到所有可能的失败情况。
但是很容易纠正；而不是test(?=\s)只写一个前瞻断言，我应该写一个(?<=\s)test(?=\s)后向和前瞻断言。

Ravi Thapliyal 选择编写(?<!\w)(test)\s(?!\1\s)，但这种编写有一些缺点：
- 必须(?!\\1\s)不是(?!\1\s)
- 它需要(test)在捕获组中定义，然后整个匹配不能简单地列在re.findall()列表中或使用生成re.finditer()器生成

他还写道(?:\w|\s(?!\\1\s))*?。我看不到(?!.(?=something))+ 用 OR 表达式复杂化模式的兴趣，而使用点可以做同样的工作。

此外，对我来说，最默认的一点是 Ravi 的正则表达式模式无法匹配包含其他字符的字符串，而不是用符号表示的字符\w

由于所有这些原因，我提出以下更正的解决方案：

import re

s1 = """test
word 1
word 2
test
word 3
tutulululalalo
protest
testimony
word
unfollow
word B
follow
word X
test
word Y
follow
"""

s2 = """test
word 1
word 2
test
word 3
tutulululalalo
protest
testimony
word ???????
unfollow
word B
follow
word X
test
word Y
follow
"""

.

# eyquem's pattern
fu = '(?<=\s)%s(?=\s)'
a = fu % 'test'
z = fu % 'follow'
pat = ('%s'
       '(?:(?!%s).)+?'
       '%s'
       % (a,a,z))

# Ravi's pattern
patRT = ('(?<!\w)(test)\s'
         '(?:\w|\s(?!\\1\s))*?(?<!\w)follow(?!\w)')


for x in (s1,s2):
    print x
    print re.findall(pat,x,re.DOTALL)
    print
    print [m.group() for m in re.finditer(patRT,x)]
    print

结果

test
word 1
word 2
test
word 3
tutulululalalo
protest
testimony
word
unfollow
word B
follow
word X
test
word Y
follow

['test\nword 3\ntutulululalalo\nprotest\ntestimony\nword\nunfollow\nword B\nfollow', 'test\nword Y\nfollow']

['test\nword 3\ntutulululalalo\nprotest\ntestimony\nword\nunfollow\nword B\nfollow', 'test\nword Y\nfollow']

.

test
word 1
word 2
test
word 3
tutulululalalo
protest
testimony
word ???????
unfollow
word B
follow
word X
test
word Y
follow

['test\nword 3\ntutulululalalo\nprotest\ntestimony\nword ???????\nunfollow\nword B\nfollow', 'test\nword Y\nfollow']

['test\nword Y\nfollow'

]

.

编辑 2

要准确回答所提出的问题：

s = """test
word 1
word 2
test
word 3
tutulululalalo
protest
testimony
word ???????
unfollow
word B
follow
word X
test
word Y
follow
"""
fu = '(?<=\s)%s(?=\s)'
a,z = fu % 'test' ,  fu % 'follow'
pat = ('%s'
       '(?:(?!%s).)+?'
       '%s'
       % (a,a,z))

def ripl(m):
    return re.sub('(?m)^(.*)$','**\\1**',m.group())

print re.sub(pat,ripl,s,flags=re.DOTALL)

ripl()是一个用于执行替换的函数，它以 RegexMatch 对象的形式接收每个匹配项，并返回转换后的部分，然后用于re.sub()进行替换

python - 匹配一个关键字的正则表达式，该关键字出现在另一个关键字之前

3 回答 3

编辑 1

编辑 2

Related

Reference