python - Python正则表达式灾难性回溯

Question

我正在从 Ms 单词生成的 XML 文件中搜索一些短语。问题是任何短语都可以被一些 XML 标记打断，这些标记可以出现在单词之间，甚至在单词内部，正如您在示例中看到的那样：

</w:rPr><w:t> To i</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:lang w:fareast="JA"/></w:rPr><w:t>ncrease knowledge of and acquired skills for implementing social policies with a view to strengthening the capacity of developing countries at the national and community level.</w:t></w:r></w:p>

所以我处理这个问题的方法是简单地将所有 XML 标记减少到相同长度的 # 个字符的簇中，这样当我可以找到任何短语时，正则表达式将忽略每两个字符之间的所有 XML 标记。

我需要的基本上是这个短语在实际 xml 文档中的跨度，所以我将使用这个跨度来处理 xml 文档，我不能使用克隆。

这种方法效果显着，但是有些短语会导致灾难性的回溯，比如下面的例子，所以我需要有人指出回溯来自哪里，或者提出更好的解决问题的方法。

=================================

这是一个例子：

我有这个文本，其中有一些 # 字符簇（我想保留），并且空格也是不可预测的，例如：

与#################战略框架的关系################## 2014-2015 年期间#### ################：计划7，经济和社会事务，次级方案3，预期
成就 (c)#######

为了匹配以下短语：

与 2014-2015 年期间战略框架的关系：方案 7，经济和社会事务，次级方案 3，预期成绩 (c)

我想出了这个正则表达式来适应不可预测的 # 和空格字符：

u'R#*e#*l#*a#*t#*i#*o#*n#*s#*h#*i#*p#*\\s*#*t#*o#*\\s*#*t#*h#*e#*\\s*#*s#*t#*r#*a#*t#*e#*g#*i#*c#*\\s*#*f#*r#*a#*m#*e#*w#*o#*r#*k#*\\s*#*f#*o#*r#*\\s*#*t#*h#*e#*\\s*#*p#*e#*r#*i#*o#*d#*\\s*#*2#*0#*1#*4#*\\-#*2#*0#*1#*5#*:#*\\s*#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\s*#*7#*\\,#*\\s*#*E#*c#*o#*n#*o#*m#*i#*c#*\\s*#*a#*n#*d#*\\s*#*S#*o#*c#*i#*a#*l#*\\s*#*A#*f#*f#*a#*i#*r#*s#*\\,#*\\s*#*s#*u#*b#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\s*#*3#*\\,#*\\s*#*e#*x#*p#*e#*c#*t#*e#*d#*\\s*#*a#*c#*c#*o#*m#*p#*l#*i#*s#*h#*m#*e#*n#*t#*\\s*#*\\(#*c#*\\)'

它在我想匹配的所有其他短语中都可以正常工作，但是这个有一个问题导致一些灾难性的回溯，有人能发现吗？

原文是用 xml 标签分隔的，所以为了让正则表达式更简单，我用这些 # 簇替换了标签，这里是原文：

</w:rPr><w:t>Relationship to the </w:t></w:r><w:r><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>strategic framework </w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t> for the period 2014-2015</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)</w:t>

score 3 · Accepted Answer

由于情况如此复杂 - 不要使用正则表达式，只需逐个符号地遍历您的行符号：

etalone = "String to find"
etalone_length = len(etalone)
counter = 0
for symbol in your_line:
    if symbol == etalone[counter]:
        counter += 1
        if counter == etalone_length:
            print("String matches")
            break
    elif symbol != " " and sybmol != "#":
        # Bad char found
        print("Does not match!")
else:  # exited 'for' before full etalone matched
    print("Does not match!")

我刚刚发现，如果我们匹配的第一个符号不是我们正在寻找的那个，那么上面的方法实际上不会起作用。这个怎么样：

克隆你的字符串
从克隆中删除“#”
匹配模式
如果模式匹配 - 获取匹配结果的位置
通过该位置 - 查找匹配的第一个符号的确切出现。就像如果整行是a#b##ca#d#f并且我们正在寻找的行是adf那么我们将从第二个 a符号开始匹配。
a在原始行中找到第 n 次出现的符号。设置计数器 =
使用上述算法（存储为跨度开始和计数器之前break作为跨度结束）

score 3 · Accepted Answer

如果我正确理解了这个问题，这是一种解决问题的方法，而无需求助于病态的正则表达式或逐个字符的解析：

def do_it(search, text_orig, verbose = False):
    # A copy of the text without any "#" markers.
    text_clean = text_orig.replace('#', '')

    # Start position of search text in the cleaned text.
    try:               i = text_clean.index(search)
    except ValueError: return [None, None]

    # Collect the widths of the runs of markers and non-markers.
    rgx    = re.compile(r'#+|[^#]+')
    widths = [len(m.group()) for m in rgx.finditer(text_orig)]

    # From that data, we can compute the span.
    return compute_span(i, len(search), widths, text_orig[0] == '#')

这是一种从宽度数据计算跨度的相当简单的方法。正如 eyquem 所指出的，我的第一次尝试是不正确的。第二次尝试是正确但复杂的。这第三种方法似乎既简单又正确。

def compute_span(span_start, search_width, widths, is_marker):
    span_end       = span_start + search_width - 1
    to_consume     = span_start + search_width
    start_is_fixed = False

    for w in widths:
        if is_marker:
            # Shift start and end rightward.
            span_start += (0 if start_is_fixed else w)
            span_end   += w
        else:
            # Reduce amount of non-marker text we need to consume.
            # As that amount gets smaller, we'll first fix the
            # location of the span_start, and then stop.
            to_consume -= w
            if to_consume < search_width:
                start_is_fixed = True
                if to_consume <= 0: break
        # Toggle the flag.
        is_marker = not is_marker

    return [span_start, span_end]

还有一系列测试以阻止批评者：

def main():
    tests = [
        #                0123456789012345678901234567890123456789
        ( [None, None], '' ),
        ( [ 0,  5],     'foobar' ),
        ( [ 0,  5],     'foobar###' ),
        ( [ 3,  8],     '###foobar' ),
        ( [ 2,  7],     '##foobar###' ),
        ( [25, 34],     'BLAH ##BLAH fo####o##ba##foo###b#ar' ),
        ( [12, 26],     'BLAH ##BLAH fo####o##ba###r## BL##AH' ),
        ( [None, None], 'jkh##jh#f' ),
        ( [ 1, 12],     '#f#oo##ba###r##' ),
        ( [ 4, 15],     'a##xf#oo##ba###r##' ),
        ( [ 4, 15],     'ax##f#oo##ba###r##' ),
        ( [ 7, 18],     'ab###xyf#oo##ba###r##' ),
        ( [ 7, 18],     'abx###yf#oo##ba###r##' ),
        ( [ 7, 18],     'abxy###f#oo##ba###r##' ),
        ( [ 8, 19],     'iji#hkh#f#oo##ba###r##' ),
        ( [ 8, 19],     'mn##pps#f#oo##ba###r##' ),
        ( [12, 23],     'mn##pab###xyf#oo##ba###r##' ),
        ( [12, 23],     'lmn#pab###xyf#oo##ba###r##' ),
        ( [ 0, 12],     'fo##o##ba###r## aaaaaBLfoob##arAH' ),
        ( [ 0, 12],     'fo#o##ba####r## aaaaaBLfoob##ar#AH' ),
        ( [ 0, 12],     'f##oo##ba###r## aaaaaBLfoob##ar' ),
        ( [ 0, 12],     'f#oo##ba####r## aaaaBL#foob##arAH' ),
        ( [ 0, 12],     'f#oo##ba####r## aaaaBL#foob##ar#AH' ),
        ( [ 0, 12],     'foo##ba#####r## aaaaBL#foob##ar' ),
        ( [ 1, 12],     '#f#oo##ba###r## aaaBL##foob##arAH' ),
        ( [ 1, 12],     '#foo##ba####r## aaaBL##foob##ar#AH' ),
        ( [ 2, 12],     '#af#oo##ba##r## aaaBL##foob##ar' ),
        ( [ 3, 13],     '##afoo##ba###r## aaaaaBLfoob##arAH' ),
        ( [ 5, 17],     'BLAHHfo##o##ba###r aaBLfoob##ar#AH' ),
        ( [ 5, 17],     'BLAH#fo##o##ba###r aaBLfoob##ar' ),
        ( [ 5, 17],     'BLA#Hfo##o##ba###r###BLfoob##ar' ),
        ( [ 5, 17],     'BLA#Hfo##o##ba###r#BL##foob##ar' ),
    ]
    for exp, t in tests:
        span = do_it('foobar', t, verbose = True)
        if exp != span:
            print '\n0123456789012345678901234567890123456789'
            print t
            print n
            print dict(got = span, exp = exp)

main()

score 1 · Accepted Answer

另一个更简单的解决方案是删除井号键

your_string.replace('#', '')

并针对替换返回的字符串测试您的正则表达式（没有所有#*）。

score 1 · Accepted Answer

在上一个答案中，我使用了re和difflib模块，以及您将每个标签替换为字符的原则。
但是我意识到您的问题可以仅使用re而无需使用任意字符进行替换即可解决。

进口

import re

数据

我使用元组能够在执行期间以更易读的形式显示数据

请注意，我稍微修改了数据以避免一些问题：在framework和period
之间仅放置一个空格，在Programm 7的两个字符串中的 Major P等

Norte 还说我在和（在日期 2014-2015 前面）添加了一系列字符### ，以表明我的代码在这种情况下仍然有效。其他答案无法管理这种可能性。phrasexmltext

短语

tu_phrase = ('Relationship to the ',
             'strategic framework ',
             'for the period ###2014-2015',
             ': Programme 7, Economic and Social Affairs, ',
             'subprogramme 3, expected accomplishment (c)')
phrase = ''.join(tu_phrase)

XML 文本

tu_xmltext = ('EEEEEEE',
              '<w:rPr>',
              'AAAAAAA',
              '</w:rPr><w:t>',
              'Relationship to the ',
              '</w:t></w:r><w:r>',
              '<w:rPr><w:i/>',
              '<w:sz w:val="17"/><w:sz-cs w:val="17"/>'
              'strategic framework ',
              '</w:t></w:r><w:r wsp:rsidRPr="00EC3076">',
              '<w:sz w:val="17"/><w:sz-cs w:val="17"/>',
              '</w:rPr><w:t>',
              'for the period ###2014-2015',
              '</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr>',
              '<w:sz w:val="17"/><w:sz-cs w:val="17"/>',
              '</w:rPr><w:t>',
              ': Programme 7, Economic and Social Affairs, ',
              'subprogramme 3, expected accomplishment (c)',
              '</w:t>',
              '321354641331')
xmltext = ''.join(tu_xmltext)

工作职能

函数olding_the_new(stuvw , pat_for_sub)返回一个三元组(pmod,w,pori)列表，表示和
中公共序列位置的对应
关系。这些序列是 in未被of捕获的序列： -描述序列 -是它的位置 -是它的宽度 [它在 re.sub(pat_for_sub, stuvw) 和 stuvw 中是相同的] -是这个的位置原序stuvwre.sub(pat_for_sub, stuvw)
stuvwgroup(1)pat_for_sub
(pmod,w)re.sub(pat_for_sub, stuvw)
pmodre.sub(pat_for_sub, stuvw)
w
poristuvw

def olding_the_new(stuvw,pat_for_sub):
    triples = []
    pmod = 0 # pmod = position in modified stuvw,
             # that is to say in re.sub(pat_for_sub,'',stuvw)
    for mat in re.finditer('{0}|([\s\S]+?)(?={0}|\Z)'.format(pat_for_sub),
                           stuvw):
        if mat.group(1):
            triples.append((pmod,mat.end()-mat.start(),mat.start()))
            pmod += mat.end()-mat.start()
    return triples


def finding(LITTLE,BIG,pat_for_sub,
            olding_the_new=olding_the_new):
    triples = olding_the_new(BIG,'(?:%s)+' % pat_for_sub)
    modBIG = re.sub(pat_for_sub,'',BIG)
    modLITTLE = re.escape(LITTLE)
    for mat in re.finditer(modLITTLE,modBIG):
        st,nd = mat.span() # in modBIG
        sori = -1 # start original, id est in BIG
        for tr in triples:
            if st < tr[0]+tr[1] and sori<0:
                sori = tr[2] + st - tr[0] 
            if nd<=tr[0]+tr[1]:
                yield(sori, tr[2] + nd - tr[0])
                break

执行

if __name__ == '__main__':
    print ('---------- phrase ----------\n%s\n'
           '\n------- phrase written in a readable form --------\n'
           '%s\n\n\n'
           '---------- xmltext ----------\n%s\n'
           '\n------- xmltext written in a readable form --------\n'
           '%s\n\n\n'
           %
           (phrase  , '\n'.join(tu_phrase),
            xmltext , '\n'.join(tu_xmltext))    )

    print ('*********************************************************\n'
           '********** Searching for phrase in xmltext **************\n'
           '*********************************************************')

    spans = finding(phrase,xmltext,'</?w:[^>]*>')
    if spans:
        for s,e in spans:
            print ("\nspan in string 'xmltext' :  (%d , %d)\n\n"
                   'xmltext[%d:%d] :\n%s'
                   % (s,e,s,e,xmltext[s:e]))
    else:
        print ("-::: The first string isn't in second string :::-")

结果

*********************************************************
********** Searching for phrase in xmltext **************
*********************************************************

span in string 'xmltext' :  (34 , 448)

xmltext[34:448] :
Relationship to the </w:t></w:r><w:r><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/>strategic framework </w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>for the period ###2014-2015</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)

诺塔贝内

当两个单词之间的空格序列在短语和 XML 文本中不完全相同时，我的代码无法检测 XML 文档中的短语。
我试图获得这种可能性，但它太复杂了。
在您的示例中，在您显示的 XML 序列中，战略框架和以下标签之间有一个空白，这些标签和以下标签之间还有一个空白，用于句点。在这种情况下，我的代码无法工作（我怀疑其他答案在这一点上可以做得更好），然后我在句xmltext点前面使用了一个没有空格的。

另一方面，我的代码不使用替换字符，那么任何字符都可以在 XML 文档和短语中，而在用作替换字符时，它们中的字符没有任何问题。

我的代码直接在原始 XML 文档中给出跨度，而不是在用替换字符修改的中间文本中。

它给出phrase了 XML 文档中的所有出现，而不仅仅是第一个。

...................................

有以下数据：

print ('\n*********************************************************\n'
       "********* Searching for 'foobar' in samples *************\n"
       '*********************************************************')

for xample in ('fo##o##ba###r## aaaaaBLfoob##arAH',
               '#fo##o##ba###r## aaaaaBLfoob##arAH',
               'BLAHHfo##o##ba###r   BLfoob##arAH',
               'BLAH#fo##o##ba###rBLUHYfoob##arAH',
               'BLA# fo##o##ba###rBLyyyfoob##ar',
               'BLA# fo##o##ba###rBLy##foob##ar',
               'kjhfqshqsk'):
    spans = list(finding('foobar',xample,'#'))
    if spans:
        print ('\n%s\n%s'
               %
               (xample,
                '\n'.join('%s  %s'
                          % (sp,xample[sp[0]:sp[1]])
                          for sp in spans))
               )
    else:
        print ("\n%s\n-::: Not found :::-" % xample)

结果是：

*********************************************************
********* Searching for 'foobar' in samples *************
*********************************************************

fo##o##ba###r## aaaaaBLfoob##arAH
(0, 13)  fo##o##ba###r
(23, 31)  foob##ar

#fo##o##ba###r## aaaaaBLfoob##arAH
(1, 14)  fo##o##ba###r
(24, 32)  foob##ar

BLAHHfo##o##ba###r   BLfoob##arAH
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

BLAH#fo##o##ba###rBLUHYfoob##arAH
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

BLA# fo##o##ba###rBLyyyfoob##ar
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

BLA# fo##o##ba###rBLy##foob##ar
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

kjhfqshqsk
-::: Not found :::-

………………………………………………………………………………………………………………………………………………

使用以下代码，我检查了您的问题：

import urllib

sock = urllib.urlopen('http://stackoverflow.com/'
                      'questions/17381982/'
                      'python-regex-catastrophic-backtracking-where')
r =sock.read()
sock.close()

i = r.find('unpredictable, such as the following')
j = r.find('in order to match the following phrase')
k = r.find('I came up with this regex ')

print 'i == %d   j== %d' % (i,j)
print repr(r[i:j])


print
print 'j == %d   k== %d' % (j,k)
print repr(r[j:k])

结果是：

i == 10408   j== 10714
'unpredictable, such as the following:</p>\n\n<blockquote>\n  Relationship to the #################strategic framework ################## for the period 2014-2015####################: Programme 7, Economic and Social Affairs, subprogramme 3, expected\n  \n  <p>accomplishment (c)#######</p>\n</blockquote>\n\n<p>so '

j == 10714   k== 10955
'in order to match the following phrase:</p>\n\n<blockquote>\n  <p>Relationship to the strategic framework for the period 2014-2015:\n  programme 7, Economic and Social Affairs, subprogramme 3, expected\n  accomplishment (c)</p>\n</blockquote>\n\n<p>'

注意program 7\n前面的 additional ， completion 前面的additional ，Program 7和program 7的区别，以及字符串 framework 中framework和句号之间存在两个空格###########期间的####### 这可以解释您在示例中遇到的困难。\n <p>

score 1 · Accepted Answer

回溯灾难可能是因为您的正则表达式包含模式的多个实例#*\\s*#*：每个实例都将匹配任何重复的块#，但它可以以多种方式匹配相同的文本。当你的正则表达式中有几个这样的模式时，可能性的数量会成倍增加。

您是否在更大的文本中进行搜索？如果是这样，文本是否包含与搜索文本开头一致的短语？如果是这样，正则表达式引擎匹配模式的开头，并在发现不匹配时开始回溯。

请注意，由于空格字符不framework ################## for匹配，正则表达式不匹配文本。f#*r#*a#*m#*e#*w#*o#*r#*k#*\\s*#*f#*o#*r

使用正则表达式的可能解决方案：

1 使用所有格量词而不是标准的贪婪量词。不幸的是，根据this page，Python不支持所有格量词。

2 将模式替换为#*\\s*#*，(#|\\s)*这将减少正则表达式匹配文本的方式。请注意，这个更改后的正则表达式可以匹配比您的原始文本更多的内容（具体来说，建议的模式将匹配## ## ##原始模式不匹配的文本）。

score 1 · Accepted Answer

以下代码显示FMc的代码不起作用。

该行
from name_of_file import olding_the_new,finding引用了我在此线程中对此问题的个人回答中的代码。
*name_of_file为包含我的代码脚本的文件命名（位于我在此线程中的另一个答案中），它将运行。
* 或者，如果您不喜欢复制粘贴我的代码，只需注释这行导入，下面的代码就会运行，因为我放置了一个 try-except 指令，该指令将正确地对缺少olding_the_new和finding

我使用两种方法来验证FMc代码的结果：
-1/ 将他的代码返回的跨度与'f' 的索引和'r' 的索引进行比较，因为我们搜索短语 'foobar' 并且我在那里管理除了foobar -2/中的那些之外，没有f和r 与我的代码返回的第一个跨度相比，因此需要上述导入 from
name_of_file

诺塔贝内

如果disp = None更改为disp == True，则执行显示有助于理解算法的中间结果。

.

import re
from name_of_file import olding_the_new,finding

def main():
    # Two versions of the text: the original,
    # and one without any of the "#" markers.
    for text_orig  in ('BLAH ##BLAH fo####o##ba###r## BL##AH',
                       'jkh##jh#f',
                       '#f#oo##ba###r##',
                       'a##xf#oo##ba###r##',
                       'ax##f#oo##ba###r##',
                       'ab###xyf#oo##ba###r##',
                       'abx###yf#oo##ba###r##',
                       'abxy###f#oo##ba###r##',
                       'iji#hkh#f#oo##ba###r##',
                       'mn##pps#f#oo##ba###r##',
                       'mn##pab###xyf#oo##ba###r##',
                       'lmn#pab###xyf#oo##ba###r##',
                       'fo##o##ba###r## aaaaaBLfoob##arAH',
                       'fo#o##ba####r## aaaaaBLfoob##ar#AH',
                       'f##oo##ba###r## aaaaaBLfoob##ar',
                       'f#oo##ba####r## aaaaBL#foob##arAH',
                       'f#oo##ba####r## aaaaBL#foob##ar#AH',
                       'foo##ba#####r## aaaaBL#foob##ar',
                       '#f#oo##ba###r## aaaBL##foob##arAH',
                       '#foo##ba####r## aaaBL##foob##ar#AH',
                       '#af#oo##ba##r## aaaBL##foob##ar',
                       '##afoo##ba###r## aaaaaBLfoob##arAH',
                       'BLAHHfo##o##ba###r aaBLfoob##ar#AH',
                       'BLAH#fo##o##ba###r aaBLfoob##ar',
                       'BLA#Hfo##o##ba###r###BLfoob##ar',
                       'BLA#Hfo##o##ba###r#BL##foob##ar',
                       ):

        text_clean = text_orig.replace('#', '')
        # Collect data on the positions and widths
        # of the markers in the original text.
        rgx     = re.compile(r'#+')
        markers = [(m.start(), len(m.group()))
                   for m in rgx.finditer(text_orig)]

        # Find the location of the search phrase in the cleaned text.
        # At that point you'll have all the data you need to compute
        # the span of the phrase in the original text.
        search = 'foobar'
        try:
            i = text_clean.index(search)
            print ('text_clean == %s\n'
                   "text_clean.index('%s')==%d   len('%s') == %d\n"
                   'text_orig  == %s\n'
                   'markers  == %s'
                   % (text_clean,
                      search,i,search,len(search),
                      text_orig,
                      markers))
            S,E = compute_span(i, len(search), markers)
            print "span = (%d,%d)  %s %s     %s"\
                  % (S,E,
                     text_orig.index('f')==S,
                     text_orig.index('r')+1==E,
                     list(finding(search,text_orig,'#+')))
        except ValueError:
            print ('text_clean == %s\n'
                   "text_clean.index('%s')   ***Not found***\n"
                   'text_orig  == %s\n'
                   'markers  == %s'
                   % (text_clean,
                      search,
                      text_orig,
                      markers))
        print '--------------------------------'

.

def compute_span(start, width, markers):
    # start and width are in expurgated text
    # markers are in original text
    disp = None # if disp==True => displaying of intermediary results
    span_start = start
    if disp:
        print ('\nAt beginning in compute_span():\n'
               '  span_start==start==%d   width==%d'
               % (start,width))
    for s, w in markers: # s and w are in original text
        if disp:
            print ('\ns,w==%d,%d'
                   '   s+w-1(%d)<start(%d) %s'
                   '   s(%d)==start(%d) %s'
                   % (s,w,s+w-1,start,s+w-1<start,s,start,s==start))
        if s + w - 1 < start:
            #mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmmwmwmwmwmwm
            # the following if-else section is justified to be used
            # only after correction of the above line to this one:
            # if s+w-1 <= start or s==start:
            #mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwm
            if s + w - 1 <= start and disp:
                print '  1a) s + w - 1 (%d) <= start (%d)   marker at left'\
                      % (s+w-1, start)
            elif disp:
                print '  1b) s(%d) == start(%d)' % (s,start)
            #mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmmwmwmwmwmwm
            # Situation: marker fully to left of our text.
            # Adjust our start points rightward.
            start      += w
            span_start += w
            if disp:
                print '  span_start == %d   start, width == %d, %d' % (span_start, start, width)
        elif start + width - 1 < s:
            if disp:
                print ('  2) start + width - 1 (%d) < s (%d)   marker at right\n'
                       '  break' % (start+width-1, s))
            # Situation: marker fully to the right of our text.
            break
        else:
            # Situation: marker interrupts our text.
            # Advance the start point for the remaining text
            # rightward, and reduce the remaining width.
            if disp:
                print "  3) In 'else': s - start == %d   marker interrupts" % (s - start)
            start += w
            width = width - (s - start)
            if disp:
                print '  span_start == %d   start, width == %d, %d' % (span_start, start, width)
    return (span_start, start + width)

.

main()

结果

>>> 
text_clean == BLAH BLAH foobar BLAH
text_clean.index('foobar')==10   len('foobar') == 6
text_orig  == BLAH ##BLAH fo####o##ba###r## BL##AH
markers  == [(5, 2), (14, 4), (19, 2), (23, 3), (27, 2), (32, 2)]
span = (12,26)  True False     [(12, 27)]
--------------------------------
text_clean == jkhjhf
text_clean.index('foobar')   ***Not found***
text_orig  == jkh##jh#f
markers  == [(3, 2), (7, 1)]
--------------------------------
text_clean == foobar
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == #f#oo##ba###r##
markers  == [(0, 1), (2, 1), (5, 2), (9, 3), (13, 2)]
span = (0,11)  False False     [(1, 13)]
--------------------------------
text_clean == axfoobar
text_clean.index('foobar')==2   len('foobar') == 6
text_orig  == a##xf#oo##ba###r##
markers  == [(1, 2), (5, 1), (8, 2), (12, 3), (16, 2)]
span = (2,16)  False True     [(4, 16)]
--------------------------------
text_clean == axfoobar
text_clean.index('foobar')==2   len('foobar') == 6
text_orig  == ax##f#oo##ba###r##
markers  == [(2, 2), (5, 1), (8, 2), (12, 3), (16, 2)]
span = (2,15)  False False     [(4, 16)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == ab###xyf#oo##ba###r##
markers  == [(2, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,19)  False True     [(7, 19)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == abx###yf#oo##ba###r##
markers  == [(3, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,18)  False False     [(7, 19)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == abxy###f#oo##ba###r##
markers  == [(4, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,19)  False True     [(7, 19)]
--------------------------------
text_clean == ijihkhfoobar
text_clean.index('foobar')==6   len('foobar') == 6
text_orig  == iji#hkh#f#oo##ba###r##
markers  == [(3, 1), (7, 1), (9, 1), (12, 2), (16, 3), (20, 2)]
span = (7,18)  False False     [(8, 20)]
--------------------------------
text_clean == mnppsfoobar
text_clean.index('foobar')==5   len('foobar') == 6
text_orig  == mn##pps#f#oo##ba###r##
markers  == [(2, 2), (7, 1), (9, 1), (12, 2), (16, 3), (20, 2)]
span = (7,18)  False False     [(8, 20)]
--------------------------------
text_clean == mnpabxyfoobar
text_clean.index('foobar')==7   len('foobar') == 6
text_orig  == mn##pab###xyf#oo##ba###r##
markers  == [(2, 2), (7, 3), (13, 1), (16, 2), (20, 3), (24, 2)]
span = (9,24)  False True     [(12, 24)]
--------------------------------
text_clean == lmnpabxyfoobar
text_clean.index('foobar')==8   len('foobar') == 6
text_orig  == lmn#pab###xyf#oo##ba###r##
markers  == [(3, 1), (7, 3), (13, 1), (16, 2), (20, 3), (24, 2)]
span = (9,24)  False True     [(12, 24)]
--------------------------------
text_clean == foobar aaaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == fo##o##ba###r## aaaaaBLfoob##arAH
markers  == [(2, 2), (5, 2), (9, 3), (13, 2), (27, 2)]
span = (0,9)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == fo#o##ba####r## aaaaaBLfoob##ar#AH
markers  == [(2, 1), (4, 2), (8, 4), (13, 2), (27, 2), (31, 1)]
span = (0,7)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaaBLfoobar
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == f##oo##ba###r## aaaaaBLfoob##ar
markers  == [(1, 2), (5, 2), (9, 3), (13, 2), (27, 2)]
span = (0,11)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == f#oo##ba####r## aaaaBL#foob##arAH
markers  == [(1, 1), (4, 2), (8, 4), (13, 2), (22, 1), (27, 2)]
span = (0,8)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == f#oo##ba####r## aaaaBL#foob##ar#AH
markers  == [(1, 1), (4, 2), (8, 4), (13, 2), (22, 1), (27, 2), (31, 1)]
span = (0,8)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobar
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == foo##ba#####r## aaaaBL#foob##ar
markers  == [(3, 2), (7, 5), (13, 2), (22, 1), (27, 2)]
span = (0,7)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == #f#oo##ba###r## aaaBL##foob##arAH
markers  == [(0, 1), (2, 1), (5, 2), (9, 3), (13, 2), (21, 2), (27, 2)]
span = (0,11)  False False     [(1, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == #foo##ba####r## aaaBL##foob##ar#AH
markers  == [(0, 1), (4, 2), (8, 4), (13, 2), (21, 2), (27, 2), (31, 1)]
span = (0,12)  False False     [(1, 13), (23, 31)]
--------------------------------
text_clean == afoobar aaaBLfoobar
text_clean.index('foobar')==1   len('foobar') == 6
text_orig  == #af#oo##ba##r## aaaBL##foob##ar
markers  == [(0, 1), (3, 1), (6, 2), (10, 2), (13, 2), (21, 2), (27, 2)]
span = (2,10)  True False     [(2, 13), (23, 31)]
--------------------------------
text_clean == afoobar aaaaaBLfoobarAH
text_clean.index('foobar')==1   len('foobar') == 6
text_orig  == ##afoo##ba###r## aaaaaBLfoob##arAH
markers  == [(0, 2), (6, 2), (10, 3), (14, 2), (28, 2)]
span = (1,14)  False True     [(3, 14), (24, 32)]
--------------------------------
text_clean == BLAHHfoobar aaBLfoobarAH
text_clean.index('foobar')==5   len('foobar') == 6
text_orig  == BLAHHfo##o##ba###r aaBLfoob##ar#AH
markers  == [(7, 2), (10, 2), (14, 3), (27, 2), (31, 1)]
span = (5,14)  True False     [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobar aaBLfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == BLAH#fo##o##ba###r aaBLfoob##ar
markers  == [(4, 1), (7, 2), (10, 2), (14, 3), (27, 2)]
span = (4,16)  False False     [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobarBLfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == BLA#Hfo##o##ba###r###BLfoob##ar
markers  == [(3, 1), (7, 2), (10, 2), (14, 3), (18, 3), (27, 2)]
span = (5,14)  True False     [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobarBLfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == BLA#Hfo##o##ba###r#BL##foob##ar
markers  == [(3, 1), (7, 2), (10, 2), (14, 3), (18, 1), (21, 2), (27, 2)]
span = (5,14)  True False     [(5, 18), (23, 31)]
--------------------------------
>>>

.

---------------------------------------------

FMc的代码很微妙，我花了很长时间才明白它的原理，然后才能够纠正它。
我会让任何人理解算法的任务。我只说使FMc的代码正常工作所需的更正：

.

第一次更正：

if s + w - 1 < start:
# must be changed to  
if s + w - 1 <= start or (s==start):

编辑

在我最初的回答中，
我写了... or (s<=start).
那是我的错误，其实我是有意写的
.. or (s==start)

关于此编辑的 NOTA BENE：

这个错误在用我在这里描述的两个更正更正的代码中没有任何后果，以更正FMc的初始代码（第一个，因为目前他已经更改了两次）。
事实上，如果你用这两个更正来更正代码，你将获得正确的结果，所有 25 个例子都是 for text_orig，以及... or (s <= start)with ... or (s==start)。
所以我认为s < start当第一个条件s+w-1 <= start为 False 时，永远不会发生 True 的情况，这可能是基于w始终大于 0 的事实以及由于标记和非标记序列的配置而导致的其他一些原因...... ..
所以我试图找到这种印象的示范......但我失败了。
此外，我达到了一种我什至不再了解FMc算法的状态（他做任何编辑之前的第一个算法）！
尽管如此，我还是让这个答案保持原样，并在这个答案的末尾发布了试图解释为什么需要这些更正的解释。
但我警告：FMc的第一个算法非常古怪且难以理解，因为它会比较属于两个不同字符串的索引，一个是带有标记 #### 的 text_orig，另一个是清除了所有这些标记... ..现在我不再相信这可能有道理....

.

第二次更正：

start += w
width = width - (s - start)
# must be changed to   
width -= (s-start) # this line MUST BE before the following one
start = s + w # because start += (s-start) + w

------------------

我很惊讶有 2 个人支持 FMc 的答案，尽管它给出了错误的代码。这意味着他们在没有测试给定代码的情况下对答案进行了投票。

--------------------------------------

.

编辑

为什么必须将条件if s + w - 1 < start:更改为这个：
if s + w - 1 <= start or (s==start):？

因为它可能会发生 s + w - 1 < start应该是 False 和sequalsstart在一起。
在这种情况下，执行转到该else部分并执行（在更正的代码中）：

width -= (s - start)
start = s + w

因此，width当我们看到相关序列时，它显然应该改变，但不会改变。

这种情况可能发生在检查第一个标记时，如以下序列：

'#f#oo##ba###r##' : s,w==0,1 , 0==s==start==0  
'ax##f#oo##ba###r##' : s,w==2,2 , 2==s==start==2    
'abxy###f#oo##ba###r##' : s,w==4,3 , 4==s==start==4  
'#f#oo##ba###r## aaaBL##foob##arAH' : s,w==0,1 , 0==s==start==0  
'BLAH#fo##o##ba###r aaBLfoob##ar' : s,w==4,1 4==s==start==4

对于以下情况，它发生在第二个标记的检查中：

'iji#hkh#f#oo##ba###r##' : s,w==7,1 , 7==s==start==7  
'mn##pps#f#oo##ba###r##' : s,w==7,1 , 7==s==start==7

通过设置执行我的代码可以更好地理解它disp = True。

当被验证时，可能相等s + w - 1 <= start的事实并不麻烦，因为执行不会进入该部分，它会进入仅添加to和 to的第一个部分。但是当is False while equals时，执行会转到指令执行不会改变任何宽度值的部分，这很麻烦。因此，必须添加条件来阻止此目的地，并且需要将其放在 an 之后以阻止此目的地，即使是 False，这可能会发生，如一些示例所示。sstartelsewsstart
s + w - 1 <= startsstartelsewidth -= (s-start)
or (s==start)elseors+w-1 <= start

.

关于s+w-1 < start必须将指令更改为s+w-1 <= start（带=）的事实，
这是因为仅w==1对应于1个字符的大小写# only ，
对于大小写
mn##pps#f#oo##ba###r##（第二个标记）
和BLAH#fo##o##ba###r（第一个标记）。

score 0 · Accepted Answer

使用 XML Parser 进行深度优先搜索？

也许记得在 xml 文档中找到文本节点的位置，以便以后反向查找。你的实际目标还不清楚。

score 0 · Accepted Answer

0

不使用regex你可以获得你想要做的事情：

text.replace('#','').replace('  ',' ')

于 2013-06-29T16:01:47.887 回答

python - Python正则表达式灾难性回溯

8 回答 8

进口

数据

短语

XML 文本

工作职能

执行

结果

诺塔贝内

...................................

………………………………………………………………………………………………………………………………………………

诺塔贝内

结果

---------------------------------------------

编辑

关于此编辑的 NOTA BENE：

------------------

--------------------------------------

编辑

Related

Reference