2

For two days I have been researching for this and have not found anything so I decided to write my own string repetition detector. Basically the function

def findRepetitions (string):

would receive a string and search for any repetitions; returns a list of strings reduced to their simplest form.

For a sample, it'd be:

findRepetitions ("trololololo") --> ["olo"]
findRepetitions ("bookkeeper") ---> ["o", "k", "e"]
findRepetitions ("Hello, Molly") -> ["l", "l"]
findRepetitions ("abcdefgh") -----> []
findRepetitions ("102102102") ----> ["102"]

In the third example, the function returns ["l", "l"] instead of ["ll"], because I want to search for repetitions only in the neighboring characters.

I know that this may be hard, but I've been literally thinking over this for a long time and cannot find any smart solution to this.

4

2 回答 2

3

这是一个众所周知的问题:

http://en.wikipedia.org/wiki/Longest_repeated_substring_problem

您可以有效地解决此问题,但要构建一个 trie:

http://en.wikipedia.org/wiki/Radix_tree

wiki 页面显示了用于查找和添加节点的伪代码和示例,这是您唯一需要的功能。在 trie 中从每个字符开始插入字符串,例如对于字符串 abcd 插入 abcd、bcd、cd、d。trie 的这个特定实例称为“后缀树”:

http://en.wikipedia.org/wiki/Suffix_tree

每次您遍历已经建立的路径时,实际上您都会在字符串中发现重复。现在,您可以在单独的数据结构中列出所有重复并提取最长的重复(如有必要)。

于 2012-11-06T18:12:57.980 回答
1

你的例子不一致。例如,olo不重复,如 l in Hello, Molly, in `trololololo; l实例之间有一个。中的连续重复trolololololololoololol。您是否要求“贪婪”算法?那么,给定trololololo,它会返回olol吗?

无论如何,这里有一些代码。

from collections import Counter

def find_repetition(p):
    """ Returns a lookup dictionary for repetitions. """ 
    lookup = Counter()
    while len(p) != 0:
        for i in xrange(len(p)):
            lookup[p[0:i]] += 1
        p = p[1:]
    return lookup

def repeats(p):
    a = find_repetition(p)
    rs = [i for i in a if a[i] > 1][1:]
    return [r for r in rs if r*2 in p]

如果您希望它像我描述的那样“贪婪”,则必须添加另一个函数,该函数从重复中获取结果,并在找到匹配项时将其剔除。

目前,结果如下所示:

test = "trololololo", "bookkeeper", "Hello, Molly", "abcdefgh", "102102102"

>>> for i in test:
>>>     repeats(i)

['lolo', 'lo', 'olol', 'ol']
['e', 'o', 'k']
['l']
[]
['210', '021', '102']

警告

find_repetition不是很快,因为它基本上会生成字符串的所有长度组合并将它们扔到 Counter 对象中。

于 2012-11-06T19:02:13.600 回答