2

我正在使用re.sub() 一些可能导致回溯的复杂模式(由代码创建)。

在 Python 2.6 中经过一定次数的迭代后,是否有任何实用的方法可以中止re.sub(比如假装未找到模式,或引发错误)?

示例(这当然是一个愚蠢的模式,但它是由复杂的文本处理引擎动态创建的):

>>>re.sub('[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*','*','ilililililililililililililililililililililililililililililililililil :x')

4

2 回答 2

4

除了分析正则表达式的灾难性回溯(外部正则表达式的一个难题)或使用不允许回溯的不同正则表达式引擎之外,我认为唯一的方法是使用这种性质的超时:

import re
import signal

class Timeout(Exception): 
    pass 

def try_one(pat,rep,s,t=3):
    def timeout_handler(signum, frame):
        raise Timeout()

    old_handler = signal.signal(signal.SIGALRM, timeout_handler) 
    signal.alarm(t) 

    try: 
        ret=re.sub(pat, rep, s)

    except Timeout:
        print('"{}" timed out after {} seconds'.format(pat,t))
        return None

    finally:
        signal.signal(signal.SIGALRM, old_handler) 

    signal.alarm(0)
    return ret

try_one(r'^(.+?)\1+$', r'\1' ,"a" * 1000000 + "b")

尝试替换单个字符的大量重复(在本例中为一百万个“a”字符)是典型的灾难性正则表达式失败。这需要数万年才能完成(至少对于 Python 或 Perl。Awk 不同)。

尝试 3 秒后,它会优雅地超时并打印:

"^(.+?)\1+$" timed out after 3 seconds
于 2012-09-18T06:39:39.190 回答
-1

count可以在这里为您提供帮助:

In [9]: re.sub ?
Type:       function
Base Class: <type 'function'>
String Form:<function sub at 0x00AC7CF0>
Namespace:  Interactive
File:       c:\python27\lib\re.py
Definition: re.sub(pattern, repl, string, count=0, flags=0)
Docstring:
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl.  repl can be either a string or a callable;
if a string, backslash escapes in it are processed.  If it is
a callable, it's passed the match object and must return
a replacement string to be used.


In [13]: a = "bbbbbbb"

In [14]: x = re.sub('b', 'a', a, count=3)

In [15]: x
Out[15]: 'aaabbbb'
于 2012-09-18T06:02:43.697 回答