我正在编写一个用于在 RNA 序列中发现模式的程序,该程序最有效。为了在序列中找到“模式”,我正在生成一些可能的模式并扫描它们的所有序列的输入文件(算法还有更多内容,但这是正在破坏的位)。生成的可能模式具有用户给定的指定长度。
这适用于最长 8 个字符的所有序列长度。然后在9点,程序运行了很长时间,然后报了java.lang.OutOfMemoryError。经过一番调试,我发现弱点是模式生成方法:
/* Get elementary pattern (ep) substrings, to later combine into full patterns */
public static void init_ep_subs(int length) {
ep_subs = new ArrayList<Substring>(); // clear static ep_subs data field
/* ep subs are of the form C1...C2...C3 where C1, C2, C3 are characters in the
alphabet and the whole length of the string is equal to the input parameter
'length'. The number of dots varies for different lengths.
The middle character C2 can occur instead of any dot, or not at all.*/
for (int i = 1; i < length-1; i++) { // for each potential position of C2
// for each alphabet character to be C1
for (int first = 0; first < alphabet.length; first++) {
// for each alphabet character to be C3
for (int last = 0; last < alphabet.length; last++) {
// make blank pattern, i.e. no C2
Substring s_blank = new Substring(-1, alphabet[first],
'0', alphabet[last]);
// get its frequency in the input string
s_blank.occurrences = search_sequences(s_blank.toString());
// if blank ep is found frequently enough in the input string, store it
if (s_blank.frequency()>=nP) ep_subs.add(s_blank);
// when C2 is present, for each character it could be
for (int mid = 0; mid < alphabet.length; mid++) {
// make pattern C1,C2,C3
Substring s = new Substring(i, alphabet[first],
alphabet[mid],
alphabet[last]);
// search input string for pattern s
s.occurrences = search_sequences(s.toString());
// if s is frequent enough, store it
if (s.frequency()>=nP) ep_subs.add(s);
}
}
}
}
}
以下是发生的情况:当我对 search_sequences 的调用进行计时时,它们从每个大约 40-100 毫秒开始,并在第一个模式中继续这种方式。然后在几百个模式(围绕'C.....GC')之后,这些调用突然开始花费大约十倍的时间,1000-2000ms。之后,时间稳步增加,直到大约 12000 毫秒('C......TA')它给出了这个错误:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.<init>(String.java:215)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
at java.nio.CharBuffer.toString(CharBuffer.java:1157)
at java.util.regex.Matcher.toMatchResult(Matcher.java:232)
at java.util.Scanner.match(Scanner.java:1270)
at java.util.Scanner.hasNextLine(Scanner.java:1478)
at PatternFinder4.search_sequences(PatternFinder4.java:217)
at PatternFinder4.init_ep_subs(PatternFinder4.java:256)
at PatternFinder4.main(PatternFinder4.java:62)
这是 search_sequences 方法:
/* Searches the input string 'sequences' for occurrences of the parameter string 'sub' */
public static ArrayList<int[]> search_sequences(String sub) {
/* arraylist returned holding int arrays with coordinates of the places where 'sub'
was found, i.e. {l,i} l = lines number, i = index within line */
ArrayList<int[]> occurrences = new ArrayList<int[]>();
s = new Scanner(sequences);
int line_index = 0;
String line = "";
while (s.hasNextLine()) {
line = s.nextLine();
pattern = Pattern.compile(sub);
matcher = pattern.matcher(line);
pattern = null; // all the =nulls were intended to help memory management, had no effect
int index = 0;
// for each occurrence of 'sub' in the line being scanned
while (matcher.find(index)) {
int start = matcher.start(); // get the index of the next occurrence
int[] occurrence = {line_index, start}; // make up the coordinate array
occurrences.add(occurrence); // store that occurrence
index = start+1; // start looking from after the last occurence found
}
matcher=null;
line=null;
line_index++;
}
s=null;
return occurrences;
}
我已经在几台不同速度的计算机上尝试了该程序,虽然在更快的计算机上完成 search_sequence 的实际时间更小,但相对时间是相同的;在大约相同的迭代次数下,search_sequence 开始需要十倍的时间才能完成。
我试过用谷歌搜索不同输入流(如 BufferedReader 等)的内存效率和速度,但普遍的共识似乎是它们都大致相当于 Scanner。你们中的任何人对这个错误是什么或我如何尝试自己解决这个问题有什么建议吗?
如果有人想查看更多代码,请询问。
编辑:
1 - 输入文件“序列”是 1000 个蛋白质序列(每个在一行上),长度不同,大约几百个字符。我还应该提到这个程序将/只需要工作/最多长度为九的模式。
2 - 这是上述代码中使用的 Substring 类方法
static class Substring {
int residue; // position of the middle character C2
char front, mid, end; // alphabet characters for C1, C2 and C3
ArrayList<int[]> occurrences; // list of positions the substring occurs in 'sequences'
String string; // string representation of the substring
public Substring(int inresidue, char infront, char inmid, char inend) {
occurrences = new ArrayList<int[]>();
residue = inresidue;
front = infront;
mid = inmid;
end = inend;
setString(); // makes the string representation using characters and their positions
}
/* gets the frequency of the substring given the places it occurs in 'sequences'.
This only counts the substring /once per line ist occurs in/. */
public int frequency() {
return PatternFinder.frequency(occurrences);
}
public String toString() {
return string;
}
/* makes the string representation using the substring's characters and their positions */
private void setString() {
if (residue>-1) {
String left_mid = "";
for (int j = 0; j < residue-1; j++) left_mid += ".";
String right_mid = "";
for (int j = residue+1; j < length-1; j++) right_mid += ".";
string = front + left_mid + mid + right_mid + end;
} else {
String mid = "";
for (int i = 0; i < length-2; i++) mid += ".";
string = front + mid + end;
}
}
}
...和 PatternFinder.frequency 方法(在 Substring.frequency() 中调用):
public static int frequency(ArrayList<int[]> occurrences) {
HashSet<String> lines_present = new HashSet<String>();
for (int[] occurrence : occurrences) {
lines_present.add(new String(occurrence[0]+""));
}
return lines_present.size();
}