python - 使用后缀数组和 lcp 在文本中查找子字符串的快速方法

Question

我正在尝试在大文本中查找包含子字符串（作为输入）的单词。文本如下所示：*america*python*erica*escape*.. 示例：输入：“rica” => 输出：america,erica

我使用后缀数组。

我的伪代码（pythonlike）是：

firstChar=input[0] // the first character of input
suffixArray=getSuffixArray(text) // suffix array
result=[]

for every index of suffix array which points to firstChar:
    length=len(input)
    indexText=text[suffixArray[index]]
    indexes=[]

    if input in text[indexText: indexText+length]:
        word=find whole word containig this index between '*' 
        result.append(word)

这行得通，但是太慢了。LCP 阵列应该改善算法的运行时间，但我不知道如何。你能给我一个建议吗？

提前致谢！

score 0 · Accepted Answer

后缀数组的免费 Python 代码是查找最长重复字符串的有效方式。它可以在个人计算机上处理多达 1 亿个字符。

python - 使用后缀数组和 lcp 在文本中查找子字符串的快速方法

1 回答 1

Related

Reference