string - 一个字符在 1 处出现的所有最长子序列

Question

在包含 n 个字符的序列 S 中；每个字符可能在序列中出现多次。您想找到 S 的最长子序列，其中相同字符的所有出现都在一个地方；

例如。如果 S = aaaccaaaccbccbbbab，那么最长的子序列（答案）是 aaaaaaccccbbbb 即 = aaa__aaacc_ccbbb_b。

换句话说，任何出现在 S 中的字母字符都只能出现在子序列中的一个连续块中。如果可能，给出一个多项式时间算法来确定解。

score 3 · Accepted Answer

设计

下面我给出一个解决这个问题的动态规划算法的 C++ 实现。运行时间的上限（可能不严格）由 O(g*(n^2 + log(g))) 给出，其中 n 是字符串的长度，g是输入。我不知道表征这个数字的好方法，但对于由 n 个不同字符组成的字符串，它可能与 O(2^n) 一样糟糕，这使得该算法在最坏的情况下呈指数级增长。它还使用 O(ng) 空间来保存 DP 记忆表。（与子字符串不同，子序列可能由原始字符串中的不连续字符组成。）在实践中，只要不同字符的数量很少，算法就会很快。

提出该算法的两个关键思想是：

长度为 n 的字符串的每个子序列要么是 (a) 空字符串，要么是 (b) 其第一个元素位于某个位置 1 <= i <= n 的子序列，其后跟从位置 i 开始的后缀上的另一个子序列+1。
如果我们一次将一个字符（或更具体的字符位置）附加到子序列中，那么为了构建所有且仅满足有效性标准的子序列，每当我们添加字符 c 时，如果添加了前一个字符 p，与 c 不同，则以后无法再添加任何 p 字符。

至少有两种方法可以管理上述第二点。一种方法是维护一组不允许的字符（例如，使用 256 位数组），当我们将字符添加到当前子序列时，我们会添加这些字符。每次我们想在当前子序列中添加一个字符时，我们首先检查它是否允许。

另一种方法是意识到，每当我们必须禁止一个字符出现在子序列的后面时，我们可以通过简单地从剩余的后缀中删除该字符的所有副本来实现这一点，并使用这个（可能更短的）字符串作为子问题来解决递归地。这种策略的优点是更有可能使用相同的字符串参数多次调用求解器函数，这意味着当递归转换为 DP 时可以避免更多的计算。这就是下面代码的工作方式。

递归函数应该采用 2 个参数：要处理的字符串，以及最近附加到函数输出将附加到的子序列的字符。必须允许第二个参数采用特殊值以指示尚未附加任何字符（这发生在顶级递归情况下）。实现此目的的一种方法是选择一个未出现在输入字符串中的字符，但这引入了不使用该字符的要求。显而易见的解决方法是传递第三个参数，一个布尔值，指示是否已经添加了任何字符。但是只使用 2 个参数会稍微方便一些：一个布尔值，指示是否已添加任何字符，以及一个字符串。如果布尔值是假的，那么字符串就是要处理的字符串。如果为真，则字符串的第一个字符被认为是最后一个添加的字符，其余的是要处理的字符串。采用这种方法意味着该函数只需要 2 个参数，从而简化了记忆。

正如我在顶部所说，这个算法在最坏的情况下是指数时间的。我想不出一种完全避免这种情况的方法，但一些优化可以帮助某些情况。我已经实现的一个方法是始终在一个步骤中添加相同字符的最大连续块，因为如果您从这样的块中添加至少一个字符，那么添加少于整个块的字符永远不会是最佳的。其他分支定界式优化是可能的，例如跟踪迄今为止的全局最佳字符串，并在我们可以确定当前子问题不能产生更长的子问题时缩短递归 - 例如，当字符数添加到到目前为止的子序列中，加上剩余的字符总数，小于迄今为止的最佳子序列的长度。

代码

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <functional>
#include <map>

using namespace std;

class RunFinder {
    string s;
    map<string, string> memo[2];    // DP matrix

    // If skip == false, compute the longest valid subsequence of t.
    // Otherwise, compute the longest valid subsequence of the string
    // consisting of t without its first character, taking that first character
    // to be the last character of a preceding subsequence that we will be
    // adding to.
    string calc(string const& t, bool skip) {
        map<string, string>::iterator m(memo[skip].find(t));

        // Only calculate if we haven't already solved this case.
        if (m == memo[skip].end()) {
            // Try the empty subsequence.  This is always valid.
            string best;

            // Try starting a subsequence whose leftmost position is one of
            // the remaining characters.  Instead of trying each character
            // position separately, consider only contiguous blocks of identical
            // characters, since if we choose one character from this block there
            // is never any harm in choosing all of them.
            for (string::const_iterator i = t.begin() + skip; i != t.end();) {
            if (t.end() - i < best.size()) {
                // We can't possibly find a longer string now.
                break;
            }

                string::const_iterator next = find_if(i + 1, t.end(), bind1st(not_equal_to<char>(), *i));
                // Just use next - 1 to cheaply give us an extra char at the start; this is safe
                string u(next - 1, t.end());
                u[0] = *i;      // Record the previous char for the recursive call
                if (skip && *i != t[0]) {
                    // We have added a new segment that is different from the
                    // previous segment.  This means we can no longer use the
                    // character from the previous segment.
                    u.erase(remove(u.begin() + 1, u.end(), t[0]), u.end());
                }
                string v(i, next);
                v += calc(u, true);

                if (v.size() > best.size()) {
                    best = v;
                }

                i = next;
            }

            m = memo[skip].insert(make_pair(t, best)).first;
        }

        return (*m).second;
    }

public:
    RunFinder(string s) : s(s) {}

    string calc() {
        return calc(s, false);
    }
};

int main(int argc, char **argv) {
    RunFinder rf(argv[1]);
    cout << rf.calc() << '\n';
    return 0;
}

示例结果

C:\runfinder>stopwatch runfinder aaaccaaaccbccbbbab
aaaaaaccccbbbb
stopwatch: Terminated. Elapsed time: 0ms
stopwatch: Process completed with exit code 0.

C:\runfinder>stopwatch runfinder abbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbf
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,mnnsdbbbf
stopwatch: Terminated. Elapsed time: 609ms
stopwatch: Process completed with exit code 0.

C:\runfinder>stopwatch -v runfinder abcdefghijklmnopqrstuvwxyz123456abcdefghijklmnop
stopwatch: Command to be run: <runfinder abcdefghijklmnopqrstuvwxyz123456abcdefghijklmnop>.
stopwatch: Global memory situation before commencing: Used 2055507968 (49%) of 4128813056 virtual bytes, 1722564608 (80%) of 2145353728 physical bytes.
stopwatch: Process start time: 21/11/2012 02:53:14
abcdefghijklmnopqrstuvwxyz123456
stopwatch: Terminated. Elapsed time: 8062ms, CPU time: 7437ms, User time: 7328ms, Kernel time: 109ms, CPU usage: 92.25%, Page faults: 35473 (+35473), Peak working set size: 145440768, Peak VM usage: 145010688, Quota peak paged pool usage: 11596, Quota peak non paged pool usage: 1256
stopwatch: Process completed with exit code 0.
stopwatch: Process completion time: 21/11/2012 02:53:22

最后一次运行耗时 8 秒并使用了 145Mb，显示了包含许多不同字符的字符串如何出现问题。

编辑：在另一个优化中添加：如果我们能证明它不可能比迄今为止发现的最好的一个，我们现在退出寻找开始子序列的位置的循环。这将最后一个示例所需的时间从 32 秒降至 8 秒！

score 0 · Accepted Answer

编辑：这个解决方案对于 OP 的问题是错误的。我不会删除它，因为它可能适合其他人。:)

考虑一个相关问题：找到给定字符连续出现的 S 的最长子序列。这可以在线性时间内解决：

char c = . . .; // the given character
int start = -1;
int bestStart = -1;
int bestLength = 0;
int currentLength = 0;
for (int i = 0; i < S.length; ++i) {
    if (S.charAt(i) == c) {
        if (start == -1) {
            start = i;
        }
        ++currentLength;
    } else {
        if (currentLength > bestLength) {
            bestStart = start;
            bestLength = currentLength;
        }
        start = -1;
        currentLength = 0;
    }
}
if (bestStart >= 0) {
    // longest sequence of c starts at bestStart
} else {
    // character c does not occur in S
}

如果不同字符的数量（称为它m）相当小，只需将此算法并行应用于每个字符。这可以通过将start, bestStart, currentLength,转换bestLength为长数组来轻松完成m。最后，扫描bestLength数组以查找最大条目的索引，并使用数组中的相应条目bestStart作为答案。总复杂度为 O(mn)。

score -1 · Accepted Answer

import java.util.*;

public class LongestSubsequence {

    /**
     * @param args
     */
    public static void main(String[] args) {
        Scanner sc = new Scanner(System.in);

        String str = sc.next();

        execute(str);

    }


    static void execute(String str) {

        int[] hash = new int[256];
        String ans = "";

        for (int i = 0; i < str.length(); i++) {

            char temp = str.charAt(i);

            hash[temp]++;
        }

        for (int i = 0; i < hash.length; i++) {
            if (hash[i] != 0) {
                for (int j = 0; j < hash[i]; j++)
                    ans += (char) i;
            }
        }

        System.out.println(ans);
    }
}

空间：256 -> O(256)，如果这样说是正确的，我不知道...，因为 O(256) 我认为是 O(1) 时间：O(n)

string - 一个字符在 1 处出现的所有最长子序列

3 回答 3

设计

代码

示例结果

Related

Reference