java - 在Java中找到给定有序子序列的超序列

Question

我在正在编写的一个不相关的程序中遇到了这个问题，我花了好几个小时试图解决它，因为我认为它会很有趣。是的，但我一直无法做到。我的代码只解决了一些子集的序列。这个问题也感觉像是一个一般的数学问题，几十年来可能已经以多种方式解决了，但我缺乏数学技能和术语来找到解决方案，或者在网上找到关于这个特定问题的任何东西。

我有一组子序列，我知道它们是更大的未知（超级？）序列的一部分。我不认为这些子序列是数学意义上的集合，因为它们是有序的，但它们的相似之处在于它们不包含重复的元素。master/super/whateversequence 也是如此。（为了清楚起见，我将其称为超序列。）

子序列都包含相同类型的数据，但是数据不是按字母顺序、升序或类似顺序排列的。从某种意义上说，数据是任意顺序的：在超序列中。这就是我感兴趣的。我想找到这些子序列的未知超序列。

为了简单起见，我尝试使用字母来解决这个问题，但我可以稍后重构代码以满足我的需要。显然，因为我仍在尝试解决这个问题，所以我首先为不包含重复元素的超序列想出了一个合适的词：FLOWCHARTS。

然后我想出了以下六个子序列：

F,W,C,R
L,H,A
L,O,H,A,R,S
C,S
R,T,S
F,O,W,H,A,S

这是我的顺序排序方法：

// LinkedHashMappedKeyValueList keeps the data in the order it was inserted and allows one key to have multiple values.
private static LinkedHashSet<Character> orderSequence(final Set<Character> unorderedSequence, final LinkedHashMappedKeyValueList ruleMap)
{
    List<Character> orderedSequence = new ArrayList<Character>(unorderedSequence);

    // Order the sequence according to the rules.
    System.out.println("---- ORDERING SEQUENCE ----");

    for (Map.Entry<Character, LinkedHashSet<Character>> rule : ruleMap.entrySet())
    {
        char currentChar = rule.getKey();
        LinkedHashSet<Character> ruleChars = rule.getValue();

        System.out.println("Processing rule " + currentChar + "<" + ruleChars.toString());

        if (orderedSequence.contains(currentChar))
        {
            int ruleCharIndex = -1;
            int smallestRuleCharIndex = Integer.MAX_VALUE;
            Iterator<Character> it = ruleChars.iterator();

            // Find the rule character with the smallest index.
            while (it.hasNext())
            {
                char ruleChar = it.next();
                ruleCharIndex = orderedSequence.indexOf(ruleChar);
                System.out.println("\tChecking for rule character: " + ruleChar + " (" + ruleCharIndex + ")");

                if (ruleCharIndex > -1 && smallestRuleCharIndex > ruleCharIndex)
                    smallestRuleCharIndex = ruleCharIndex;
            }

            if (smallestRuleCharIndex != Integer.MAX_VALUE)
                System.out.println("\tMoving '" + currentChar + "' before '"
                        + orderedSequence.get(smallestRuleCharIndex) + "'.");
            else
                System.out.println("\tMoving '" + currentChar + "' to the end of the sequence.");

            if (!moveBefore(orderedSequence.indexOf(currentChar), smallestRuleCharIndex, orderedSequence))
                System.out.println("\tAlready in correct position.");
            else
                System.out.println("\tCurrent sequence: " + listToString(orderedSequence));
        }
        else
            throw new ArithmeticException("Element of a subsequence not a part of the sequence.");
    }

    return new LinkedHashSet<Character>(orderedSequence);
}

最后，我的代码找到了F,L,O,W,H,C,A,R,T,S这些子序列的超序列，它非常接近但并不完美。我还需要多次运行我的排序方法，所以我想出的“算法”也不完美。“规则映射”是一个哈希映射，其中键是字符对象的另一个哈希映射，它位于子序列中的键 Character 之后（因此在超序列中）。

是否有某种我可以使用的 Java 库来进行这种序列查找？有人可以在告诉我这叫什么和/或帮助我找到适合这项工作的算法方面指出我正确的方向吗？

此外，我的程序的缩短控制台输出：

---- BUILDING RULE MAP ----
Subsequences:   F,W,C,R
        L,H,A
        L,O,H,A,R,S
        C,S
        R,T,S
        F,O,W,H,A,S

All subsequences processed. Number of ordering rules: 10
Rule map: (F<[W, O]),(W<[C, H]),(C<[R, S]),(R<[, S, T]),(L<[H, O]),(H<[A]),(A<[, R, S]),(O<[H, W]),(S<[]),(T<[S])

---- BUILDING UNORDERED SEQUENCE ----
Sequence size is 10.
Unordered sequence: F,W,C,R,L,H,A,O,S,T

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'W'.
    Already in correct position.
Processing rule W<[C, H]
    Moving 'W' before 'C'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Already in correct position.
Processing rule R<[, S, T]
    Moving 'R' before 'S'.
    Current sequence: F,W,C,L,H,A,O,R,S,T
Processing rule L<[H, O]
    Moving 'L' before 'H'.
    Already in correct position.
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,W,C,L,H,O,A,R,S,T
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Current sequence: F,O,W,C,L,H,A,R,S,T
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Current sequence: F,O,W,C,L,H,A,R,T,S
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,W,C,R,L,H,A,O,S,T
Ordered sequence:   F,O,W,C,L,H,A,R,T,S
Sequences match:    false

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'O'.
    Already in correct position.
Processing rule W<[C, H]
    Moving 'W' before 'C'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Current sequence: F,O,W,L,H,A,C,R,T,S
Processing rule R<[, S, T]
    Moving 'R' before 'T'.
    Already in correct position.
Processing rule L<[H, O]
    Moving 'L' before 'O'.
    Current sequence: F,L,O,W,H,A,C,R,T,S
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,L,O,W,H,C,A,R,T,S
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Already in correct position.
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Already in correct position.
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,O,W,C,L,H,A,R,T,S
Ordered sequence:   F,L,O,W,H,C,A,R,T,S
Sequences match:    false

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'O'.
    Current sequence: L,F,O,W,H,C,A,R,T,S
Processing rule W<[C, H]
    Moving 'W' before 'H'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Current sequence: L,F,O,W,H,A,C,R,T,S
Processing rule R<[, S, T]
    Moving 'R' before 'T'.
    Already in correct position.
Processing rule L<[H, O]
    Moving 'L' before 'O'.
    Current sequence: F,L,O,W,H,A,C,R,T,S
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,L,O,W,H,C,A,R,T,S
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Already in correct position.
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Already in correct position.
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,L,O,W,H,C,A,R,T,S
Ordered sequence:   F,L,O,W,H,C,A,R,T,S
Sequences match:    true
Sequence ordered according to the limits of the rule map.
Sequence found after 2 tries.

Expected sequence:  F,L,O,W,C,H,A,R,T,S FLOWCHARTS
Found sequence:     F,L,O,W,H,C,A,R,T,S FLOWHCARTS
Sequences match:    false

score 1 · Accepted Answer

您要求的是从部分订单计算总订单。我在这方面找不到太多工作。但是我们可以在这里讨论一下这个问题。

考虑A<B<C<D。如果我们有序列和A<C，我们将永远无法计算总顺序。我们只得到和。B<DC<DA<C<DB<D

我认为可以证明我们将需要表单的所有关系N-1，并在最终链中连续呈现，以重建总顺序（可能有额外的关系，但那些是额外的信息）。作为一个非严格的演示，假设我们有并且假设我们能够从偏序重构为。现在，为了将其放入总顺序中的正确位置，我们需要知道. 没有其他关系可以让它适合总订单。继续向下进入X<YXYA1<A2<A3<...<ANA_beginA_endA_(begin-1)<A_beginA_begin..A_end我们应该能够通过某种归纳/无限下降来证明我们将需要单词的连续字符给出的所有关系来重构它。

上述序列集中缺少的信息是F<L和C<H。可以得到的序列W->C->R->T->S是F->O->W->H->A->R->T->S和L->O->W->H->A->R->T->S。计算余数需要更多信息。

在上述情况下，经过分解和重复消除，我们有以下关系：

A,R
A,S <-- redundant since R<S and A<R
C,R
C,S <-- redundant since R<S and A<R
F,O 
F,W <-- redundant since O<W and F<O
H,A
L,H <-- redundant since O<H and L<O
L,O
O,H <-- redundant since O<W and W<H
O,W
R,S <-- redundant since T<S and R<T
R,T
T,S
W,C
W,H

有 16 个关系，其中 6 个是立即冗余的。去除冗余，我们得到以下 10 个关系：

A,R <-- consecutive letters in actual word
C,R
F,O 
H,A <-- consecutive letters in actual word
L,O <-- consecutive letters in actual word
O,W <-- consecutive letters in actual word
R,T <-- consecutive letters in actual word
T,S <-- consecutive letters in actual word
W,C <-- consecutive letters in actual word
W,H

原始序列中唯一缺少的是F<L和C<H。附加的给定关系C<R，F<O并W,H重复 LHS 或 RHS 并给出不可操作的信息（基本上这些链接两个部分有序的链但不在端点，所以你知道一个链将被合并或小于另一个链但不要知道在哪里）。

一旦添加了缺失的关系，就有多种方法可以实现这一点。您的代码可能会自行运行。

score 0 · Accepted Answer

我在问题中描述的问题是最短的常见超序列问题。我没有在网上搜索这个，因为我在编写代码时过于专注于数学集，只有在我开始写出我的问题后，我才意识到我正在处理序列。其实我以为我只是当场编造了“超序列”这个词。事实证明，这正是我需要在网上搜索以找到与此问题相关的几页材料所需的词。

奥利弗查尔斯沃思的评论是绝对正确的。给定的子序列缺少生成正确超序列的信息，因此在我的示例中，无法找到实际的超序列。Coffee对生物信息学的评论很可能是指最短的常见超弦问题，该问题在 DNA 序列的重建中具有应用，在这个问题中进行了讨论。最短的常见超序列和超串问题乍一看非常相似，但是，在超串问题中，子串必须由超串的相邻元素组成，这与超序列问题不同。（如下图所示。）

这两个问题也是NP-complete或“NP-hard”。我理解它的方式是，这些问题没有最佳解决方案或一些神奇的算法，您可以将其复制粘贴到您的代码中。只有近似值和“足够好”的解决方案。

令人惊讶的是，我无法找到与此问题近似的完整 Java 代码，但我确实遇到了一些带有其他语言代码的站点。我在下面列出了一些我在研究这个问题时发现的有用资源。我还包括了与最短的常见超弦问题相关的资源。

java - 在Java中找到给定有序子序列的超序列

2 回答 2

进一步研究的其他资源

java - 在Java中找到给定有序子序列的超序列

2 回答 2

进一步研究的其他资源

Related

Reference