python - Python：子序列搜索

Question

我有一个序列“abccabac”和一个子序列“abc”。我需要获取序列中所有出现的子序列“abc”的索引。什么是内存有效的方法？

例子：

输入：

**sequence**: 'abccabac' **subsequence**: 'abc'

输出：

score 2 · Accepted Answer

我能想到的绝对最严格的方法是使用itertools.combinations

import itertools

sequence = "abccabac"

subsequence = "abc"

for combo in itertools.combinations(enumerate(sequence), len(subsequence)):
    if "".join(pair[1] for pair in combo) == subsequence:
        print([pair[0] for pair in combo])

如果您的实际序列包含许多甚至不在子序列中的字符，那么在开始之前过滤掉不相关的字符combinations肯定会大大提高效率：

char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
for combo in itertools.combinations(char_generator, len(subsequence)):
    ...

除了加入每个组合之外，您还可以只使用all它只会检查字符，直到发现一个不相等：

char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
for combo in itertools.combinations(char_generator, len(subsequence)):
    if all(pair[1]==char for pair,char in zip(combo,subsequence)):
        print([pair[0] for pair in combo])

这是我刚刚放在一起的另一种选择，我想算法可以进一步优化，但这是我能想到的最好的。为子序列的sub_sequence_generator每个字母创建一个索引列表，然后sub_helper递归的每个级别遍历从最后一个字母的索引开始的子序列下一个字母的所有索引。

import itertools
import bisect

def sub_helper(cur_i, combo, all_options):
    #cur_i is the index of the letter in the subsequence
    #cur_idx is the index of the sequence which contains that letter
    if cur_i>0:
        prev_i = combo[cur_i-1]
    else:
        prev_i = -1
    cur_options = all_options[cur_i]
    start_i = bisect.bisect(cur_options,prev_i)
    cur_i_gen = itertools.islice(cur_options,start_i,None)
    if cur_i+1 == len(all_options): #last one:
        for cur_idx in cur_i_gen:
            combo[cur_i] = cur_idx
            yield tuple(combo)
    else:
        for cur_idx in cur_i_gen:
            combo[cur_i] = cur_idx
            yield from sub_helper(cur_i+1, combo, all_options)


def sub_sequence_generator(sequence,sub_seq):
    indices_map = {c:[] for c in set(sub_seq)} #create a list for each character
    for i,c in enumerate(sequence):
        try:
            indices_map[c].append(i,) #the helper is simpler if they are all stored as single element tuples
        except KeyError:
            pass

    # now we have indices for each character of the sub_sequence
    # we just need to put them in a sequence that corelates to the subsequence
    chr_indices = tuple(indices_map[c] for c in sub_seq)

    return sub_helper(0,[None]*len(chr_indices), chr_indices)

sequence = "abccabac"
subsequence = "abc"

for idxs in sub_sequence_generator(sequence,subsequence):
    print(idxs)

就内存而言，这将为子序列的每个字符创建一个列表，其中包含该字符所在的主序列中的索引。这些列表的元组，以及不断更新的索引列表和每次组合时的元组yield，以及像这样的迭代器islice，这是非常节省内存的，但是由于我没有对其进行广泛测试，所以我不能给出任何保证它没有错误。

python - Python：子序列搜索

例子：

输入：

1 回答 1

Related

Reference