algorithm - 找到所有相同长度的最短唯一子串？

Question

给定一个仅包含四个字母的字符串序列，['a','g','c','t'] 例如：agggcttttaaaatttaatttgggccc.

找到字符串序列中所有长度相等的最短唯一子串（长度应该是所有唯一子串中的最小值）？

例如：aaggcgccttt 答案：['aa', 'ag', 'gg','cg', 'cc','ct'] 解释：长度为 2 的最短唯一子串

我尝试使用后缀数组和最长公共前缀，但我无法完美地绘制解决方案。

score 0 · Accepted Answer

我不确定您所说的“最小唯一子字符串”是什么意思，但是看看您的示例，我假设您的意思是“单个字母的最短运行”。如果是这种情况，您只需要遍历字符串一次（逐个字符）并计算您找到的所有最短运行。您应该跟踪到目前为止找到的最小运行的长度（开始时为无穷大）和当前运行的长度。

~~如果您需要找到确切的运行，您可以在遍历字符串时将找到的所有最小运行添加到例如列表中（如果找到更短的运行，则相应地修改该列表）。~~

编辑： 我对这个问题进行了更多思考，并提出了以下解决方案。

我们找到所有长度为 i 的唯一子串（按升序排列）。因此，首先我们考虑所有长度为 1 的子字符串，然后考虑所有长度为 2 的子字符串，依此类推。如果我们找到任何东西，我们就停下来，因为子字符串的长度只能从这一点开始增加。

您将不得不使用一个列表来跟踪您到目前为止看到的子字符串，并使用一个列表来存储实际的子字符串。当您找到新的子字符串时，您还必须相应地维护它们。

这是我想出的 Java 代码，以防您需要它：

        String str = "aaggcgccttt";
        String curr = "";
        ArrayList<String> uniqueStrings = new ArrayList<String>();
        ArrayList<String> alreadySeen = new ArrayList<String>();

        for (int i = 1; i < str.length(); i++) {
            for (int j = 0; j < str.length() - i + 1; j++) {
                curr = str.substring(j, j + i); 

                if (!alreadySeen.contains(curr)){ //Sub-string hasn't been seen yet
                    uniqueStrings.add(curr);
                    alreadySeen.add(curr);
                }
                else //Repeated sub-string found
                    uniqueStrings.remove(curr);
            }

            if (!uniqueStrings.isEmpty()) //We have found non-repeating sub-string(s)
                break;

            alreadySeen.clear();
        }

        //Output
        if (uniqueStrings.isEmpty())
            System.out.println(str);
        else {
            for (String s : uniqueStrings)
                System.out.println(s);
        }

该uniqueStrings列表包含所有最小长度的唯一子字符串（用于输出）。该alreadySeen列表跟踪所有已经看到的子字符串（用于排除重复的子字符串）。

score 0 · Accepted Answer

我将用 Python 编写一些代码，因为这是我发现的最简单的方法。我实际上写了重叠和非重叠的变体。作为奖励，它还检查输入是否有效。您似乎只对重叠变体感兴趣：

import itertools


def find_all(
        text,
        pattern,
        overlap=False):
    """
    Find all occurrencies of the pattern in the text.

    Args:
        text (str|bytes|bytearray): The input text.
        pattern (str|bytes|bytearray): The pattern to find.
        overlap (bool): Detect overlapping patterns.

    Yields:
        position (int): The position of the next finding.
    """
    len_text = len(text)
    offset = 1 if overlap else (len(pattern) or 1)
    i = 0
    while i < len_text:
        i = text.find(pattern, i)
        if i >= 0:
            yield i
            i += offset
        else:
            break


def is_valid(text, tokens):
    """
    Check if the text only contains the specified tokens.

    Args:
        text (str|bytes|bytearray): The input text.
        tokens (str|bytes|bytearray): The valid tokens for the text.

    Returns:
        result (bool): The result of the check.
    """
    return set(text).issubset(set(tokens))


def shortest_unique_substr(
        text,
        tokens='acgt',
        overlapping=True,
        check_valid_input=True):
    """
    Find the shortest unique substring.

    Args:
        text (str|bytes|bytearray): The input text.
        tokens (str|bytes|bytearray): The valid tokens for the text.
        overlap (bool)
        check_valid_input (bool): Check if the input is valid.

    Returns:
        result (set): The set of the shortest unique substrings.
    """
    def add_if_single_match(
            text,
            pattern,
            result,
            overlapping):
        match_gen = find_all(text, pattern, overlapping)
        try:
            next(match_gen)  # first match
        except StopIteration:
            # the pattern is not found, nothing to do
            pass
        else:
            try:
                next(match_gen)
            except StopIteration:
                # the pattern was found only once so add to results
                result.add(pattern)
            else:
                # the pattern is found twice, nothing to do
                pass

    # just some sanity check
    if check_valid_input and not is_valid(text, tokens):
        raise ValueError('Input text contains invalid tokens.')

    result = set()
    # shortest sequence cannot be longer than this
    if overlapping:
        max_lim = len(text) // 2 + 1
        max_lim = len(tokens)
        for n in range(1, max_lim + 1):
            for pattern_gen in itertools.product(tokens, repeat=2):
                pattern = ''.join(pattern_gen)
                add_if_single_match(text, pattern, result, overlapping)
            if len(result) > 0:
                break
    else:
        max_lim = len(tokens)
        for n in range(1, max_lim + 1):
            for i in range(len(text) - n):
                pattern = text[i:i + n]
                add_if_single_match(text, pattern, result, overlapping)
            if len(result) > 0:
                break
    return result

在对输出的正确性进行一些健全性检查后：

shortest_unique_substr_ovl = functools.partial(shortest_unique_substr, overlapping=True)
shortest_unique_substr_ovl.__name__ = 'shortest_unique_substr_ovl'

shortest_unique_substr_not = functools.partial(shortest_unique_substr, overlapping=False)
shortest_unique_substr_not.__name__ = 'shortest_unique_substr_not'

funcs = shortest_unique_substr_ovl, shortest_unique_substr_not

test_inputs = (
    'aaa',
    'aaaa',
    'aaggcgccttt',
    'agggcttttaaaatttaatttgggccc',
)

import functools

for func in funcs:
    print('Func:', func.__name__)
    for test_input in test_inputs:    
        print(func(test_input))
    print()

Func: shortest_unique_substr_ovl
set()
set()
{'cg', 'ag', 'gg', 'ct', 'aa', 'cc'}
{'tg', 'ag', 'ct'}

Func: shortest_unique_substr_not
{'aa'}
{'aaa'}
{'cg', 'tt', 'ag', 'gg', 'ct', 'aa', 'cc'}
{'tg', 'ag', 'ct', 'cc'}

以我们实际的速度为基准是明智的。

您可以在下面找到一些基准，这些基准是使用此处的一些模板代码生成的（重叠变体为蓝色）：

以及其余代码的完整性：

def gen_input(n, tokens='acgt'):
    return ''.join([tokens[random.randint(0, len(tokens) - 1)] for _ in range(n)])


def equal_output(a, b):
    return a == b


input_sizes = tuple(2 ** (1 + i) for i in range(16))

runtimes, input_sizes, labels, results = benchmark(
    funcs, gen_input=gen_input, equal_output=equal_output,
    input_sizes=input_sizes)

plot_benchmarks(runtimes, input_sizes, labels, units='ms')
plot_benchmarks(runtimes, input_sizes, labels, units='μs', zoom_fastest=2)

就渐近时间复杂度分析而言，仅考虑重叠情况，令N为输入大小，令K为令牌数（在您的情况下为 4），find_all()为 O(N)，主体shortest_unique_substr为O(K²)( + O((K - 1)²) + O((K - 2)²) + ...) . 因此，这是整体的O(N*K²)或O(N*(Σk²))（对于k = 1, …, K），因为K是固定的，所以这是O(N)，正如基准似乎表明的那样。

algorithm - 找到所有相同长度的最短唯一子串？

2 回答 2

Related

Reference