java - 实现带有子序列匹配的 sortedMap

Question

我想用键和值实现一个 sortedMap，以便可以通过提供一些子序列来搜索键。例如，地图包含 3 个条目：

abcd -> obj1
def -> obj2
abccd -> obj3

对于 query ac， result 应该是一个包含第 1 和第 3 个条目的子图，但对于 query acc，应该只返回第 3 个条目。

我应该在内部使用什么样的数据结构来有效地返回这样的子图？例如，Treemap哪个将键存储在树（trie）中以根据前缀有效地返回子图？

score 1 · Accepted Answer

To summarize what I've written in comments, I'd make it as following:

Create additional index HashMap with entries [2-char subsequence -> list of words].
On addition: generate all distinct 2-char subsequences from given word, add this word to every corresponding entry of index map.
On querying: generate all distinct 2-char subsequences from a query, among them find one which corresponds to the shortest list in index map, take this list. Filter it by full query and collect corresponding values from the main map.

If a query consists of one character, then perform full scan. I believe this would have better space/complexity than creating additional index for single chars, especially when 1-char queries are rare.

score 0 · Accepted Answer

You can try an aho-corasick with an wildcard. Aho-Corasick is the faster multiple pattern matching algorithm and with a wildcard it gives all substrings. You can try my php implementation at codeplex (https://phpahocorasick.codeplex.com). For example an unittest with wildcard for gene pattern:

$tree = new Ahocorasick\Ahocorasick();
$tree->add("AC");
$tree->add("GTG");
$tree->add("AACT");
echo $tree->match("ACCGAGTGCGTGGACAAACTACGATTGTGGAATGAACT","AC*GT");
$this->expectOutputString("ACCGAGT,ACCGAGTGCGT,ACCGAGTGCGTGGACAAACTACGATTGT,ACAAACTACGATTGT,ACTACGATTGT,ACGATTGT");

score 0 · Accepted Answer

注意：我的解决方案适用于查找子字符串。阅读编辑。

解决方案：

使用前缀数据结构：http ://en.wikipedia.org/wiki/Trie

您将需要存储：

abcd -> obj1

作为：

abcd -> obj1
bcd -> obj1
cd -> obj1
d -> obj1

为了找到你的结果，你将在 trie 中深入到满足条件和 DFS 从那里找到所有可能的解决方案。

例子：

ab -> obj1
cde -> obj1
bad -> obj2

我们将在 Trie 中插入下一个条目：

ab -> obj1
b -> obj1
cde -> obj1
de -> obj1
e -> obj1
bad -> obj2
ad -> obj2
d -> obj2

现在想象我们寻找下一个条目：

首先在 Trie 中向下移动字符 d。之后我们将进行 DFS。我们已经在 obj2 处，如果我们在 obj1 处移动字符 e。所以我们可以通过条目 d 得到它们。

cd

首先我们为字符 c 移动，然后为字符 d 移动。现在我们进行 DFS，我们能找到的唯一路径是添加字符 e，它导致 obj1。

efg

Trie 中没有满足条件的路径，因此我们没有解决方案。

编辑：

如果我们正在寻找原始元素的子字符串，我的解决方案就有效。要修改我的解决方案以使其适用于任何非传染性子字符串，您还必须在 Trie 中插入所有排列。

score 0 · Accepted Answer

/**
 * @author viborole
 *
 */
public class SortedMapSubsequence {

/**
 * @param args
 */
public static void main(String[] args) {
    Map<String, String> data = new HashMap<String,String>();
    data.put("abcd", "obj1");
    data.put("def", "obj2");
    data.put("abccd", "obj3");
    String query="acc";
    search(query, data);
}

private static void search(String query, Map<String, String> data) {
    for (Map.Entry<String, String> entry : data.entrySet())
    {
        String key=entry.getKey();
        char[] k=key.toCharArray();
        char[] q=query.toCharArray();
        int kpos=0;
        char[] found = new char[q.length];
        for(int i=0;i<q.length;i++){
            if(i>0){
                kpos++;
            }
            for(int j=kpos;j<k.length;j++){
                if(k[j]==q[i]){
                    kpos=j;
                    found[i]=k[j];
                    break;
                } 
            }
        }
        if(Arrays.equals(found,q)){
            System.out.println("found : "+ entry.getValue());
        }
    }
}

}

score 0 · Accepted Answer

您的“字符”可以解释为传统搜索引擎 AND 查询中的术语。

从这个评论：

abcd 的所有子序列，例如 a,b,c,d,ab, ac, ad, bc, bd,cd , abc, abd, acd, abcd 等，当被查询时，应该返回一个带有 abcd -> obj1 的子图它的条目之一。

我们可以解释为，对于带有 Aspid Beaver Cherokee Deer (ABCD) 术语的文档，在搜索其中的任何一个词或它们的任何组合时都应该返回该文档。

您要为此构建的是倒排索引。有关更多详细信息，请参阅此答案。本质上，您将为每个“字符”（=搜索引擎中的术语）构建一个或查找表（没有那么多字符），其中查找将返回它出现的所有位置HashMap（=搜索引擎中的文档-说话），理想情况下按升序排列。objX

要查找与一组字符相关联的所有对象，您将连接每个单独字符的集合。由于它们是有序的，因此您可以在线性时间内计算集合交集。

如果查询不ba应该匹配一个键，您可以优化查找表 /以存储字符位置（例如：存储“b 在 Obj1 中的位置 0”，“a 在 Obj1 中的位置 1”，而不是丢弃位置）。在计算搜索交集时，您将按搜索顺序继续并丢弃具有不正确相对顺序的匹配项。abcHashMap

这是搜索引擎的常规，并且已经在性能方面进行了详尽的分析。

编辑：性能

如果不同的“术语”的数量很少（例如：26 个小写英文字母），您可以通过添加 n-gram 作为术语来加快速度（对于“abc”，您可以添加“a”、“b” 、“c”、“ab”、“ac”、“bc”和“abc”）。相同的规则将适用 - 有一半的搜索，并且由于预期会出现重叠，因此可以使用 Sasha 的技巧来避免存储索引。

java - 实现带有子序列匹配的 sortedMap

5 回答 5

Related

Reference