2

Foreword: My question is mainly an algorithmic question, so even if you are not familiar with suffix and LCP arrays you can probably help me.

In this paper it is described how to efficiently use suffix and LCP arrays for string pattern matching.

I understood SA and LCP work and how the algorithm's runtime can be improved from O(P*log(N)) (where P is the length of the pattern and N is length of the string) to O(P+log(N)) (Thanks to Chris Eelmaa's answer here and jogojapans answer here).

I was trying to go through the algorithm in figure 4 which explains the usage of LLcp and RLcp. But I have problems understanding how it works.

The algorithm (taken from the source):

Pattern matching algorithm

Explanation of the used variable names:

lcp(v,w) : Length of the longest common prefix of v and w
W = w0..wP-1 : pattern of length P
A = a0..aN-1 : the text (length N)
Pos[0..N-1] : suffix array
L_W : index (in A) of first occurrence of the matched pattern
M : middle index of current substring
L : lower bound
R : upper bound
Lcp : array of size N-2 such that Lcp[M] = lcp(A_Pos[L_M], A_pos[M]) where L_M is the lower bound of the unique interval with M in the middle
Rcp : array of size N-2 such that Rcp[M] = lcp(A_Pos[R_M], A_pos[M]) where R_M is the upper bound of the unique interval with M in the middle

Now I want to try the algorithm using the following example (partly taken from here):

 SA | LCP | Suffix entry
 -----------------------
 5  | N/A | a  
 3  | 1   | ana  
 1  | 3   | anana  
 0  | 0   | banana  
 4  | 0   | na  
 2  | 2   | nana 

A = "banana" ; N = 6
W = "ban" ; P = 3

I want to try to match a string, say ban and would expect the algorithm to return 0 as L_W.

Here is how I would step through the algorithm:

l = lcp("a", "ban") = 0
r = lcp("nana", "ban") = 0
if 0 = 3 or 'b' =< 'a' then // which is NOT the case for both conditions
    L_W = 0
else if 0 < 3 or 'b' =< 'n' then // which is the case for both conditions 
    L_W = 6 // which means 'not found'
...
...

I feel like I am missing something but I can't find out what. Also I am wondering how the precomputed LCP array can be used instead of calling lcp(v,w).

4

1 回答 1

1

我相信有一个错误。

第一个条件相当容易理解。当 LCP 长度 == 模式长度时,就完成了。当您的图案甚至小于或等于最小的图案时,唯一的选择就是最小的图案。

第二个条件是错误的。我们可以用反证法来证明。r < P || Wr <= a... 表示 r >= P && Wr > a... 如果 r >= P,那么我们怎么能有 Lw = N(未找到),因为我们已经有了 r 长度的公共前缀?

于 2018-04-15T13:04:32.497 回答