构建一个带有最长公共前缀的后缀数组,并找到 K 个值 > 阈值的连续 LCP(其中 K 是另一个阈值)
在搜索中,要求找到的后缀的候选LCP在源字符串中是连续的/不重叠。搜索算法:
let minimum_iterations = 1
let minimum_pattern_length = 2
let src = source string
let suffix = sorted suffix array,
where each entry is suffix i0 (inclusive) to |src| (exclusive)
let lcp = longest common prefixes,
where lcp(i) = longest common prefix of suffix(i) and suffix(i-1) when i > 0,
or 0 if i = 0
for i = 0 to |suffix|:
if lcp[i] < minimum_pattern_length: continue
let j = i
let iterations = 0
let last_matched_i0 = suffix[i-1].i0
while j < |suffix| and lcp[j] >= lcp[i]:
if (suffix[j].i0 + lcp[i] + 1) = last_matched_i0:
iterations++
last_matched_i0 = suffix[j].i0
j++
print 'pattern = ' .. suffix[i][0..lcp[i]] .. ' iterations = ' .. iterations
print 'starting at ' .. suffix[j-1].i0 .. ' to ' .. suffix[i-1].i0 + lcp[i]
skip pattern if wanted
(suffix[j].i0 + lcp[i] + 1) = last_matched_i0
确保匹配的模式是连续的。如果它们都是模式,我们假设suffix[j]
包含,因为在排序的后缀数组中总是在前面suffix[j-1]
ab
abab
示例:3ababab65
后缀
- 3ababab65
- ababab65
- babab65
- abab65
- bab65
- ab65
- b65
- 65
- 5
排序后缀以及最长公共前缀 (LCP - Suffix):LCP(i) 是 SUFFIX(i) 和 SUFFIX(i-1) 之间的最长公共前缀
- 0 - 5
- 0 - 65
- 0 - 3ababab65
- 0 - ab65
- 2 - ab ab65
- 4 - abab ab65
- 0 - b65
- 1 - b ab65
- 3 -巴布ab65
LCP 候选人:
- ab 65, ab ab65, ab abab65 - 连续,无重叠
- abab 65, abab ab65 - 重叠
- b 65, b ab65 - 不连续
- bab 65, bab ab65 - 重叠
可能的模式:
您必须稍微修改代码以处理以下情况:
亚贝巴
您将有 aba 和 ababa 的后缀,LCP 为 3
Or you could just do this... Shortest Repeating Sub-String ... Donno how efficient it is... but way easier.