python - 给定一个单词列表，识别所有长度为 4 或更大的相同子串

Question

假设我有一个外来词列表：

伊力库瓦
阿里库瓦
尼利芬迪沙
阿纳芬迪沙
金枪鱼
图利索马

我想在这个单词列表中识别单词中常见的长度为 4 或更大的子字符串。例如，单词“kuwa”、“fundisha”和“soma”都属于这一类。

然后，当我进行频率分析时：

cnt = Counter()
for lines in list:
    cnt[words]
print cnt.most_common(2000)

我希望将这些子字符串计算为它们出现在整个列表中的次数......这样最终输出： print cnt.most_common(3) 就像这样。

库瓦 - 2
芬迪沙 - 2
躯体- 2
ilikuwa- 1 ...等

不过，我完全不知道如何去做。有任何想法吗？

score 4 · Accepted Answer

您已经在使用 a Counter，因此缺少的只是一种生成任何给定字符串的子字符串的方法。如果该位位于某个函数中，该函数需要一个字符串和一个子字符串的最小长度，那么您的计数逻辑可以是单行的，并得到以下帮助itertools.chain：

cnt = Counter(chain.from_iterable(substrings(line, 4) for line in lines))
cnt.most_common(2000)

Which leaves the problem of working out how to generate those substrings. The easiest way to do this is to loop over the possible sizes of substrings, and then loop over the string and give back the slice starting at each successive position in the string, and having the given length (but since slices in Python take a start and an end index, we need to do some slice arithmetic to make that work):

def substrings(s, min_length=1):
   for length in range(min_length, len(s)+1):
     for start in range(len(s) - min_length + 1):
        yield s[start:start+length]

score 1 · Accepted Answer

如果效率很重要，我相信您将需要一个Suffix Array。

如wiki所示，使用后缀数组可以统计任意子串在O(m+logN)中出现的次数，其中m是子串的长度，N是所有单词的总长度。

然而，您仍然需要枚举每个单词的所有子字符串。我不认为在最坏的情况下可以避免 O(N*N) 枚举。但是使用 dict() 来避免对重复的子字符串进行多次检查肯定会提高平均情况下的性能。

python - 给定一个单词列表，识别所有长度为 4 或更大的相同子串

2 回答 2

Related

Reference