algorithm - 在集合中寻找模式

Question

我可以使用哪些算法来确定一组字符串中的常见字符？

为了使示例简单，我只关心连续 2 个以上的字符，以及它是否出现在 2 个或更多示例中。例如：

0000abcde0000
0000abcd00000
000abc0000000
00abc000de000

我想知道：

00 用于 1,2,3,4
000 用于 1,2,3,4
0000 用于 1,2,3
00000 用于 2,3
ab 用于 1,2,3,4
abc用于 1,2,3,4
abcd 用于 1,2
bc 用于 1,2,3,4
bcd 用于 1,2
cd 用于 1,2
de 用于 1,4

score 3 · Accepted Answer

我假设这不是家庭作业。（如果是，你就是你自己的抄袭！;-)

下面是一个快速而肮脏的解决方案。时间复杂度是O(m**2 * n)平均m字符串长度，n是字符串数组的大小。

的实例Occurrence保留包含给定字符串的索引集。该commonOccurrences例程扫描一个字符串数组，调用captureOccurrences每个非空字符串。该captureOccurrences例程将当前索引放入Occurrence给定字符串的每个可能的子字符串中。最后，通过仅选择具有至少两个索引commonOccurrences的那些来形成结果集。Occurrences

请注意，您的示例数据具有比您在问题中确定的更多常见子字符串。例如，"00ab"出现在每个输入字符串中。根据内容（例如所有数字、所有字母等）选择有趣字符串的附加过滤器——正如他们所说——留给读者作为练习。;-)

快速而肮脏的 JAVA 源：

package com.stackoverflow.answers;

import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

public class CommonSubstringFinder {

    public static final int MINIMUM_SUBSTRING_LENGTH = 2;

    public static class Occurrence implements Comparable<Occurrence> {
        private final String value;
        private final Set<Integer> indices;
        public Occurrence(String value) {
            this.value = value == null ? "" : value;
            indices = new TreeSet<Integer>();
        }
        public String getValue() {
            return value;
        }
        public Set<Integer> getIndices() {
            return Collections.unmodifiableSet(indices);
        }
        public void occur(int index) {
            indices.add(index);
        }
        public String toString() {
            StringBuilder result = new StringBuilder();
            result.append('"').append(value).append('"');
            String separator = ": ";
            for (Integer i : indices) {
                result.append(separator).append(i);
                separator = ",";
            }
            return result.toString();
        }
        public int compareTo(Occurrence that) {
            return this.value.compareTo(that.value);
        }
    }

    public static Set<Occurrence> commonOccurrences(String[] strings) {
        Map<String,Occurrence> work = new HashMap<String,Occurrence>();
        if (strings != null) {
            int index = 0;
            for (String string : strings) {
                if (string != null) {
                    captureOccurrences(index, work, string);
                }
                ++index;
            }
        }
        Set<Occurrence> result = new TreeSet<Occurrence>();
        for (Occurrence occurrence : work.values()) {
            if (occurrence.indices.size() > 1) {
                result.add(occurrence);
            }
        }
        return result;
    }

    private static void captureOccurrences(int index, Map<String,Occurrence> work, String string) {
        final int maxLength = string.length();
        for (int i = 0; i < maxLength; ++i) {
            for (int j = i + MINIMUM_SUBSTRING_LENGTH; j < maxLength; ++j) {
                String partial = string.substring(i, j);
                Occurrence current = work.get(partial);
                if (current == null) {
                    current = new Occurrence(partial);
                    work.put(partial, current);
                }
                current.occur(index);
            }
        }
    }

    private static final String[] TEST_DATA = {
        "0000abcde0000",
        "0000abcd00000",
        "000abc0000000",
        "00abc000de000",
    };
    public static void main(String[] args) {
        Set<Occurrence> found = commonOccurrences(TEST_DATA);
        for (Occurrence occurrence : found) {
            System.out.println(occurrence);
        }
    }

}

样本输出：（请注意，实际上每行只有一次出现；我似乎无法阻止块引用标记合并行）

“00”：0,1,2,3 “000”：0,1,2,3
“0000”：0,1,2 “0000a”：0,1
“0000ab”：0,1 “0000abc”：0 ,1
"0000abcd": 0,1 "000a": 0,1,2
"000ab": 0,1,2 "000abc": 0,1,2
"000abcd": 0,1 "00a": 0,1 ,2,3
"00ab": 0,1,2,3 "00abc": 0,1,2,3
"00abc0": 2,3 "00abc00": 2,3
"00abc000": 2,3 "00abcd" : 0,1
"0a": 0,1,2,3 "0ab": 0,1,2,3
"0abc": 0,1,2,3 "0abc0": 2,3
"0abc00": 2, 3 "0abc000": 2,3
"0abcd": 0,1 "ab": 0,1,2,3 "abc": 0,1,2,3 “abc0”：2,3 “abc00”：2,3
“abc000”：2,3 “abcd”：0,1 “bc”：0,1,2,3 “bc0”：2,3 “bc00” : 2,3
"bc000": 2,3 "bcd": 0,1 "c0": 2,3 "c00": 2,3 "c000": 2,3 "cd": 0,1
“de”：0,3 “de0”：0,3 “de00”：0,3
“e0”：0,3 “e00”：0,3

score 2 · Accepted Answer

这很可能是一个 NP 难题。它看起来类似于多序列比对，即。基本上，您可以根据需要调整多维Smith-Waterman（= 局部序列比对）。不过，可能有更有效的算法。

score 2 · Accepted Answer

构建一棵树，其中通过树的路径是字母序列。让每个节点都包含一个“集合”，将字符串引用添加到其中（或者只保留一个计数）。然后跟踪单词中的 N 个位置，其中 N 是您关心的最长序列（例如，在每个字符处开始一个新句柄，在每个步骤中向下遍历所有句柄，并在 N 步后中止每个句柄）

这对于小型、有限和密集的字母表会更有效（DNA 是我认为第一个使用它的地方）。

编辑：如果您事先知道您关心的模式，则可以通过提前构建树然后只检查您是否在树上而不是扩展它来更改上述模式以使其正常工作。

一个例子

输入

abc
abd
abde
acc
bde

那个树

a : 4
  b : 3
    c : 1
    d : 2
      e : 1
  c : 1
    c : 1
b : 4
  d : 3
    e : 2
  c : 1
c : 3
  c : 1
d : 3
  e : 2

score 1 · Accepted Answer

您是否知道需要提前搜索的“价值观”？或者您是否需要代码来解析字符串，并像您发布的那样为您提供统计信息？

如果您提前知道要查找的内容，则使用 Boyer-Moore 算法是判断子字符串是否存在（甚至定位它们）的一种非常快速的方法。

score 1 · Accepted Answer

在网上查找“后缀树”。或者选择 Dan Gusfield 的“字符串、树和序列的算法”。我没有要验证的书，但是后缀树上的维基百科页面说，第 205 页包含解决您的问题的方法：“找到一组中至少 k 个字符串共有的最长子字符串”。

score 0 · Accepted Answer

您可以使用距离矩阵的分析。任何对角线移动（无成本变化）都是完全匹配的。

score 0 · Accepted Answer

您可能会发现后缀数组比后缀树更简单、更有效，具体取决于数据中常见子字符串的频率——如果它们足够常见，您将需要更复杂的后缀数组构造算法。（天真的方法是只使用您的库排序功能。）

algorithm - 在集合中寻找模式

7 回答 7

Related

Reference