python - 使用后缀数组查找出现的子字符串

Question

我试图弄清楚如何在后缀数组中进行二进制搜索以查找一次模式的出现。让我们有一个文本：。我试图找到.petertomasjohnerrnoerrorer

SA是此文本的后缀数组：8,14,19,3,1,12,10,7,13,17,18,11,6,22,0,23,16,21,15,20,4,9,2,5

现在，我想找到后缀数组的任何索引，它的值指向一个'er'. 所以输出将是 SA 中指向的索引，3,14 or 19因此它将返回 1,2 或 3

我正在尝试使用二进制搜索，但我不知道如何。

def findOneOccurence(text,SA,p):
    high = len(text)-1           # The last index
    low = 0                      # the lowest index
    while True:
        check = (high-low)/2     # find a middle

        if p in text[SA[check]:SA[check]+len(p)]:
            return check
        else:
            if text[SA[check]:SA[check]+len(p)]<p:
                low = check
            else:
                high = check
        if high<=low:
            return None

这返回11。但text[SA[11]:SA[11]+2]不是。'oh'_ 'er'问题可能出在哪里？

此功能适用于大约数百万个字符的大量文本。

编辑：我发现了一个错误。而不是 if text[SA[check]:SA[check+len(p)]]<p:should betext[SA[check]:SA[check]+len(p)]<p:但它仍然是错误的。它返回 None 而不是'er'

编辑二：另一个错误：如果 high>=low 更改为 high<=low，现在它返回 2，这很好。

编辑三：现在它可以工作了，但是在某些输入上它会进入循环并且永远不会结束。

score 2 · Accepted Answer

借用和编辑https://hg.python.org/cpython/file/2.7/Lib/bisect.py

>>> text= 'petertomasjohnerrnoerror'
>>> SA = 8,14,19,3,1,12,10,7,13,17,18,11,6,22,0,23,16,21,15,20,4,9,2,5
>>> def bisect_left(a, x, text, lo=0, hi=None):
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if text[a[mid]:] < x: lo = mid+1
        else: hi = mid
    if not text[a[lo]:].startswith(x): 
        # i suppose text[a[lo]:a[lo]+len(x)] == x could be a faster check
        raise IndexError('not found')
    return a[lo]

>>> bisect_left(SA, 'er', text)
14

python - 使用后缀数组查找出现的子字符串

1 回答 1

Related

Reference