-2

大家好,我正在尝试读取基因组序列并搜索出现的任何 10 个字符重复。我想到的解决方案分为三个步骤:

  1. 读取基因组序列,例如:GAAAAATTTTCCCCCACCCTTTTCCCC
  2. 将字符串切成十个连续的序列,例如第一个新生成的字符串是索引 0-9,下一个是 1-10,2-11,3-12...
  3. 将这些序列存储在 ArrayList 中
  4. 比较字符串
  5. 返回重复的序列以及它们重复的频率。

我遇到的麻烦是如何从旧的和更大的字符串生成一个新的字符串。假设我的基因组序列是 AAAAGGGGGAAAATTTCCCC,那么我的前十个字符序列将是 AAAAGGGGGA,下一个将是 AAAGGGGGAA。我将如何在java中做到这一点?

这是我到目前为止所拥有的:

import java.util.List;
import java.util.ArrayList;

public class Solution
{
    public ArrayList<String> findRepeatedDnaSequences(String s) 
    {
        ArrayList<String> sequence = new ArrayList<String>();
        int matches;
        ArrayList<String> matchedSequence = new ArrayList<String>();
        for(int i = 0; i < s.length(); i++)
        {
            if (i + 9 > s.length())
            {
                sequence.add(s.substring(i, i + 9));
            }

        }
        for(int i = 0; i < sequence.size(); i++)
        {
            matches = 0;
            for (int j = 1; j < sequence.size(); j++)
            {
                if(sequence.get(i) == sequence.get(i))
                {
                    matches++;
                    System.out.print(matches);
                    matchedSequence.add(sequence.get(i));
                }
            }
        }
        return matchedSequence;
    }
}
4

3 回答 3

0
Following is the complete class that you are looking for. The code is pretty self explanatory.
package source;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.zip.InflaterInputStream;

public class PatternFinding {

    //function to find the patterns
    public static List<String> stringMatcher(String str,int len){
        String string="";
        int count=1;
        List<String> list=new ArrayList<String>();
        for(int i=0;i+len<=str.length();i++){
            System.out.print(i);
            string="";
            count=1;
            char ch=str.charAt(i);
            string+=String.valueOf(ch);
            for(int j=i+1;j<str.length() && j<i+len;j++){
                System.out.println(" "+j);
                if(ch==str.charAt(j)){
                    count++;
                    string+=String.valueOf(str.charAt(j));
                }else{
                    break;
                }
            }
            System.out.println(string);
            if(count==len){
                list.add(string);
            }
        }
        return list;
    }

    public static void main(String[] args) throws IOException {
        BufferedReader br=new BufferedReader(new InputStreamReader(System.in));
        String text=br.readLine();
        //pass the length of your pattern as second arguement
        List<String> list=stringMatcher(text,5);

        //sorting the list
        Collections.sort(list);
        for(int i=0;i<list.size();i++){
            System.out.println(list.get(i));
        }

        //counting occurances
        for(int i=0;i<list.size();){
            String str=list.get(i);
            int lastIndex=list.lastIndexOf(str);
            System.out.println(str+" happens "+ (lastIndex-i+1)+" times");
            i=lastIndex+1;
        }

    }
}
于 2016-02-16T00:10:17.020 回答
0
public class MainClass {

    public static void main(String[] args){
        printAllSequences("GAAAAATTTTCCCCCACCCTTTTCCCC", 10);
    }

    public static void printAllSequences(String DNASequence, int subSequenceSize){
        for(int i=0; i<DNASequence.length() - subSequenceSize - 1; i++){
            System.out.println(DNASequence.substring(i, i + subSequenceSize));
        }
    }

}
于 2016-02-15T22:55:31.710 回答
0

如果您使用的是 Java 8,则可以使用流来完成。不幸的是,Stream API 中缺少许多其他编程语言中存在的方法,但我们仍然可以自己实现它们。所以使用sliding这个答案的方法:

如何将字符串流转换为字符串对流?

你可以这样做:

String gseq = "AAAAACCCCCAAAAACCCCC";

Map<String, Long> count = StreamUtils.sliding(10, gseq.chars().boxed())
        .map(l -> new String(l.stream().mapToInt(n -> n).toArray(), 0, l.size()))
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

这将生成一个映射,其中包含每个长度为 10 的子字符串的计数。

于 2016-02-15T23:20:12.823 回答