您已经在使用 a Counter
,因此缺少的只是一种生成任何给定字符串的子字符串的方法。如果该位位于某个函数中,该函数需要一个字符串和一个子字符串的最小长度,那么您的计数逻辑可以是单行的,并得到以下帮助itertools.chain
:
cnt = Counter(chain.from_iterable(substrings(line, 4) for line in lines))
cnt.most_common(2000)
Which leaves the problem of working out how to generate those substrings. The easiest way to do this is to loop over the possible sizes of substrings, and then loop over the string and give back the slice starting at each successive position in the string, and having the given length (but since slices in Python take a start and an end index, we need to do some slice arithmetic to make that work):
def substrings(s, min_length=1):
for length in range(min_length, len(s)+1):
for start in range(len(s) - min_length + 1):
yield s[start:start+length]