java - Java字符串搜索期间突然变慢和java.lang.OutOfMemoryError

Question

我正在编写一个用于在 RNA 序列中发现模式的程序，该程序最有效。为了在序列中找到“模式”，我正在生成一些可能的模式并扫描它们的所有序列的输入文件（算法还有更多内容，但这是正在破坏的位）。生成的可能模式具有用户给定的指定长度。

这适用于最长 8 个字符的所有序列长度。然后在9点，程序运行了很长时间，然后报了java.lang.OutOfMemoryError。经过一番调试，我发现弱点是模式生成方法：

/* Get elementary pattern (ep) substrings, to later combine into full patterns */
public static void init_ep_subs(int length) {

ep_subs = new ArrayList<Substring>(); // clear static ep_subs data field

/* ep subs are of the form C1...C2...C3 where C1, C2, C3 are characters in the
   alphabet and the whole length of the string is equal to the input parameter
   'length'. The number of dots varies for different lengths.
The middle character C2 can occur instead of any dot, or not at all.*/

for (int i = 1; i < length-1; i++) { // for each potential position of C2

    // for each alphabet character to be C1
    for (int first = 0; first < alphabet.length; first++) { 

    // for each alphabet character to be C3
    for (int last = 0; last < alphabet.length; last++) {

        // make blank pattern, i.e. no C2
        Substring s_blank = new Substring(-1, alphabet[first],
                          '0', alphabet[last]);

        // get its frequency in the input string
        s_blank.occurrences = search_sequences(s_blank.toString());

        // if blank ep is found frequently enough in the input string, store it
        if (s_blank.frequency()>=nP) ep_subs.add(s_blank);

        // when C2 is present, for each character it could be
        for (int mid = 0; mid < alphabet.length; mid++) {

        // make pattern C1,C2,C3
        Substring s = new Substring(i, alphabet[first],
                        alphabet[mid],
                        alphabet[last]);

        // search input string for pattern s
        s.occurrences = search_sequences(s.toString());

        // if s is frequent enough, store it
        if (s.frequency()>=nP) ep_subs.add(s);
        }
    }
    }
}
}

以下是发生的情况：当我对 search_sequences 的调用进行计时时，它们从每个大约 40-100 毫秒开始，并在第一个模式中继续这种方式。然后在几百个模式（围绕'C.....GC'）之后，这些调用突然开始花费大约十倍的时间，1000-2000ms。之后，时间稳步增加，直到大约 12000 毫秒（'C......TA'）它给出了这个错误：

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3209)
    at java.lang.String.<init>(String.java:215)
    at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
    at java.nio.CharBuffer.toString(CharBuffer.java:1157)
    at java.util.regex.Matcher.toMatchResult(Matcher.java:232)
    at java.util.Scanner.match(Scanner.java:1270)
    at java.util.Scanner.hasNextLine(Scanner.java:1478)
    at PatternFinder4.search_sequences(PatternFinder4.java:217)
    at PatternFinder4.init_ep_subs(PatternFinder4.java:256)
    at PatternFinder4.main(PatternFinder4.java:62)

这是 search_sequences 方法：

/* Searches the input string 'sequences' for occurrences of the parameter string 'sub' */
public static ArrayList<int[]> search_sequences(String sub) {

/* arraylist returned holding int arrays with coordinates of the places where 'sub'
 was found, i.e. {l,i} l = lines number, i = index within line */
ArrayList<int[]> occurrences = new ArrayList<int[]>();
s = new Scanner(sequences);
int line_index = 0;

String line = "";
while (s.hasNextLine()) {
    line = s.nextLine();
    pattern = Pattern.compile(sub);
    matcher = pattern.matcher(line);
    pattern = null; // all the =nulls were intended to help memory management, had no effect

    int index = 0;

    // for each occurrence of 'sub' in the line being scanned
    while (matcher.find(index)) {
    int start = matcher.start(); // get the index of the next occurrence
    int[] occurrence = {line_index, start}; // make up the coordinate array
    occurrences.add(occurrence); // store that occurrence
    index = start+1; // start looking from after the last occurence found
    }
    matcher=null;
    line=null;
    line_index++;

}
s=null;

return occurrences;
}

我已经在几台不同速度的计算机上尝试了该程序，虽然在更快的计算机上完成 search_sequence 的实际时间更小，但相对时间是相同的；在大约相同的迭代次数下，search_sequence 开始需要十倍的时间才能完成。

我试过用谷歌搜索不同输入流（如 BufferedReader 等）的内存效率和速度，但普遍的共识似乎是它们都大致相当于 Scanner。你们中的任何人对这个错误是什么或我如何尝试自己解决这个问题有什么建议吗？

如果有人想查看更多代码，请询问。

编辑：

1 - 输入文件“序列”是 1000 个蛋白质序列（每个在一行上），长度不同，大约几百个字符。我还应该提到这个程序将/只需要工作/最多长度为九的模式。

2 - 这是上述代码中使用的 Substring 类方法

static class Substring {
int residue; // position of the middle character C2
char front, mid, end; // alphabet characters for C1, C2 and C3
ArrayList<int[]> occurrences; // list of positions the substring occurs in 'sequences'
String string; // string representation of the substring

public Substring(int inresidue, char infront, char inmid, char inend) {
    occurrences = new ArrayList<int[]>();
    residue = inresidue;
    front = infront;
    mid = inmid;
    end = inend;
    setString(); // makes the string representation using characters and their positions
}

/* gets the frequency of the substring given the places it occurs in 'sequences'. 
   This only counts the substring /once per line ist occurs in/. */
public int frequency() {
    return PatternFinder.frequency(occurrences);
}

public String toString() {
    return string;
}

/* makes the string representation using the substring's characters and their positions */
private void setString() {
    if (residue>-1) {
    String left_mid = "";
    for (int j = 0; j < residue-1; j++) left_mid += ".";
    String right_mid = "";
    for (int j = residue+1; j < length-1; j++) right_mid += ".";
    string = front + left_mid + mid + right_mid + end;
    } else {
    String mid = "";
    for (int i = 0; i < length-2; i++) mid += ".";
    string = front + mid + end;
    }
}
 }

...和 PatternFinder.frequency 方法（在 Substring.frequency() 中调用）：

public static int frequency(ArrayList<int[]> occurrences) {
    HashSet<String> lines_present = new HashSet<String>();
    for (int[] occurrence : occurrences) {
        lines_present.add(new String(occurrence[0]+""));
    }
    return lines_present.size();
    }

score 0 · Accepted Answer

什么是字母？你给它什么样的正则表达式？您是否检查过要存储的出现次数？有可能仅仅存储事件就足以使其内存不足，因为您正在进行指数级的搜索。

听起来您的算法具有隐藏的指数资源使用情况。你需要重新考虑你想要做什么。

此外，将局部变量设置为 null 也无济于事，因为 JVM 已经进行了数据流和活跃度分析。

编辑：这是一个页面，它解释了即使是短的正则表达式也可能需要指数级的时间来运行。

score 0 · Accepted Answer

我无法发现明显的内存泄漏，但您的程序确实有许多低效之处。以下是一些建议：

正确缩进你的代码。它将使您和其他人的阅读变得更加容易。目前的形式很难阅读。
如果您指的是成员变量，请在其前面加上this.，否则代码片段的读者将无法确定您指的是什么。
除非绝对必要，否则避免使用静态成员和方法。Classname.membername出于同样的原因，在提及它们时，请使用表格。
的代码frequency()与 just 有何不同return occurrences.size()？
在search_sequences()中，正则表达式字符串sub是一个常数。您只需要编译一次，但您要为每一行重新编译它。
将输入字符串 ( sequences) 拆分为几行，然后将它们存储在数组或ArrayList. 不要在内部重新拆分search_sequences()，将拆分集合传入。

可能还有更多需要解决的问题，但这是跳出来的列表。

解决所有这些问题，如果您仍然有问题，您可能需要使用分析器来找出正在发生的事情。

java - Java字符串搜索期间突然变慢和java.lang.OutOfMemoryError

2 回答 2

Related

Reference