java - 在巨大的字符序列中查找一组字符的索引

Question

假设我有一个非常大的 AD 字符序列，确切地说是 40 亿。我的目标是在该大字符序列中找到长度设置为 30 的几个新字母序列的索引。当您正在寻找的序列有一个小错误（一个字母是错误的）时，这个问题的难度也会增加。我应该如何解决这个问题？

最简单的方法是在整个 40 亿个文本文件中一次迭代一个字母，但是随着内存耗尽，这将永远需要。

有人告诉我要使用哈希图，但我不确定到底要使用什么作为我的键值对。使用正则表达式的想法也出现了，但我不完全确定它是否能解决我的问题。在方向方面的任何帮助将不胜感激。谢谢！

这是我要问的说明：

score 4 · Accepted Answer

这是一个经典的问题，称为最长公共子序列（LCS）。有很多算法可以解决它。基因组计划经常进行这种搜索。提供的 wiki 链接有很多示例。您的错误阈值将是一种特殊情况。

你在做基因测序吗？我问只是因为你只提到了 4 个变量:)

score 3 · Accepted Answer

通过对字符进行编码，您每使用 2 位就浪费了 14 位。你可以在一个字节中编码四个核苷酸字母，那么你只需要半千兆字节。至于算法，您可以研究Boyer-Moore algorithmjava.lang.String.indexOf中的代码和维基百科页面。

顺便说一句，如果您为此使用 Lucene 索引，您可以立即进行搜索。这个想法是在 Lucene 中将每个 30 个字母的子序列索引为一个单独的文档。至于容错，您需要使用 N-gram，或进行模糊搜索（在 Lucene 4 中，有一种新算法可以快速定位编辑距离高达 2 或 3 的字符串）。

score 1 · Accepted Answer

这是处理表示的快速简便的代码。

public static enum Nucleotide { 
    A,B,C,D;
}

public static int setbit(int val, int pos, boolean on) {
    if (on) {
                    // set bit
        return val | (1 << (8-pos-1));
    }
    else {
                    // unset bit
        return val & ~(1 << (8-pos-1));         
    }
}

public static int set2bits(int val, int pos, int bits) {
            // set/unset the first bit 
    val = setbit(val, pos, (bits & 2) > 0);
            // set/unset the second bit
    val = setbit(val, pos+1, (bits & 1) > 0);

    return val;
}

public static int setNucleotide(int sequence, int pos, Nucleotide tide) {
            // set both bits based on the ordinal position in the enum
    return set2bits(sequence, pos*2, tide.ordinal());
}

public static void setNucleotide(int [] sequence, int pos, Nucleotide tide) {
            // figure out which element in the array to work with
    int intpos = pos/4;
            // figure out which of the 4 bit pairs to work with.
    int bitpos = pos%4;
    sequence[intpos] = setNucleotide(sequence[intpos], bitpos, tide);       
}

public static Nucleotide getNucleotide(int [] sequence, int pos) {
    int intpos = pos/4;
    int bitpos = pos%4;
    int val = sequence[intpos];
            // get the bits for the requested on, and shift them
            // down into the least significant bits so we can
            // convert batch to the enum.
    int shift = (8-(bitpos+1)*2);       
    int tide = (val & (3 << shift)) >> shift;
    return Nucleotide.values()[tide];

}

public static void main(String args[]) {
    int sequence[] = new int[4];
    setNucleotide(sequence, 4, Nucleotide.C);
    System.out.println(getNucleotide(sequence, 4));
}

显然，发生了很多位移，但少量的评论应该对正在发生的事情有意义。

当然，这种表示的缺点是您以 4 个为一组工作。如果要说 10 个核苷酸，则必须在计数中保留另一个变量，以便您知道序列中的最后 2 个核苷酸不是有用。

如果没有别的，可以用蛮力完成模糊匹配。您将输入一个 N 核苷酸序列，然后从 0 开始，检查核苷酸 0:N-1 并查看有多少匹配。然后你从 1:N 然后 2:N+1 等...

java - 在巨大的字符序列中查找一组字符的索引

3 回答 3

Related

Reference