java - 将一个文本文件解析为多个文本文件

Question

我想通过Java解析输入文件来获取多个文件。输入文件包含数千个蛋白质序列的许多 fasta 格式，我想生成每个蛋白质序列的原始格式（即，没有任何逗号分号，也没有任何额外的符号，如“>”、“[”、“]”等）。

fasta 序列从“>”符号开始，然后是蛋白质描述，然后是蛋白质序列。

例如 ► >lcl|NC_000001.10_cdsid_XP_003403591.1 [gene=LOC100652771] [protein=假设蛋白质 LOC100652771] [protein_id=XP_003403591.1] [location=join(12190..12227,12595..12721,13403). ] MSESINFSHNLGQLLSPPRCVVMPGMPFPSIRSPELQKTTADLDHTLVSVPSVAESLHHPEITFLTAFCL PSFTRSRPLPDRQLHHCLALCPSFALPAGDGVCHGPGLQGSCYKGETQESVESRVLPGPRHRH

与上述格式一样，输入文件包含 1000 多个蛋白质序列。我必须生成数千个原始文件，其中仅包含单个蛋白质序列，没有任何特殊符号或间隙。

我已经用 Java 开发了它的代码，但输出是：无法打开文件，然后找不到文件。

请帮我解决我的问题。

问候 Vijay Kumar Garg Varanasi Bharat（印度）

代码是

/*Java code to convert FASTA format to a raw format*/
import java.io.*;
import java.util.*;
import java.util.regex.*;
import java.io.FileInputStream;

// java package for using regular expression
public class Arrayren
{
    public static void main(String args[]) throws IOException  
    {
        String a[]=new String[1000];
        String b[][] =new String[1000][1000];
        /*open the id file*/
        try
        {
            File f = new File ("input.txt"); 
            //opening the text document containing genbank ids
            FileInputStream fis = new FileInputStream("input.txt");
            //Reading the file contents through inputstream
            BufferedInputStream bis = new BufferedInputStream(fis);
            // Writing the contents to a buffered stream
            DataInputStream dis = new DataInputStream(bis);
            //Method for reading Java Standard data types
            String inputline;
            String line;
            String separator = System.getProperty("line.separator");
            // reads a line till next line operator is found
            int i=0;
            while ((inputline=dis.readLine()) != null) 
            {
                i++;
                a[i]=inputline;
                a[i]=a[i].replaceAll(separator,"");
                //replaces unwanted patterns like /n with space
                a[i]=a[i].trim();
                // trims out if any space is available
                a[i]=a[i]+".txt";
                //takes the file name into an array
                try
                // to handle run time error
                /*take the sequence in to an array*/
                {
                    BufferedReader in = new BufferedReader (new FileReader(a[i]));
                    String inline = null;
                    int j=0;
                    while((inline=in.readLine()) != null)
                    {
                        j++;
                        b[i][j]=inline;
                        Pattern q=Pattern.compile(">");
                        //Compiling the regular expression
                        Matcher n=q.matcher(inline);
                        //creates the matcher for the above pattern
                        if(n.find())
                        {
                            /*appending the comment line*/
                            b[i][j]=b[i][j].replaceAll(">gi","");
                            //identify the pattern and replace it with a space
                            b[i][j]=b[i][j].replaceAll("[a-zA-Z]","");
                            b[i][j]=b[i][j].replaceAll("|","");
                            b[i][j]=b[i][j].replaceAll("\\d{1,15}","");
                            b[i][j]=b[i][j].replaceAll(".","");
                            b[i][j]=b[i][j].replaceAll("_","");
                            b[i][j]=b[i][j].replaceAll("\\(","");
                            b[i][j]=b[i][j].replaceAll("\\)","");
                        }
                        /*printing the sequence in to a text file*/
                        b[i][j]=b[i][j].replaceAll(separator,"");
                        b[i][j]=b[i][j].trim();
                        // trims out if any space is available
                        File create = new File(inputline+"R.txt");
                        try
                        {
                            if(!create.exists())
                            {
                                create.createNewFile();
                                // creates a new file
                            }
                            else
                            {
                                System.out.println("file already exists");
                            }
                        }
                        catch(IOException e)
                        // to catch the exception and print the error if cannot open a file
                        {
                            System.err.println("cannot create a file");
                        }
                        BufferedWriter outt = new BufferedWriter(new FileWriter(inputline+"R.txt", true));
                        outt.write(b[i][j]);
                        // printing the contents to a text file
                        outt.close();
                        // closing the text file
                        System.out.println(b[i][j]);
                    }
                }
                catch(Exception e)
                {
                    System.out.println("cannot open a file");
                }
            }
        }
        catch(Exception ex)
        // catch the exception and prints the error if cannot find file
        {
            System.out.println("cannot find file ");
        }
    }
}

如果您提供正确的信息，它将更容易理解。

score 0 · Accepted Answer

您的代码包含以下两个catch块：

    catch(Exception e)
    {
        System.out.println("cannot open a file");
    }

    catch(Exception ex)
    // catch the exception and prints the error if cannot find file
    {
        System.out.println("cannot find file ");
    }

这两个都吞下了异常并打印了一条通用的“它不起作用”消息，它告诉您该catch块已被输入，但仅此而已。

异常通常包含有用的信息，可以帮助您找出真正的问题所在。通过忽略它们，您将更难诊断您的问题。更糟糕的是，您正在捕获Exception，这是许多异常的超类，因此这些catch块正在捕获许多不同类型的异常并忽略它们。

从异常中获取信息的最简单方法是调用其printStackTrace()方法，该方法会打印异常类型、异常消息和堆栈跟踪。在这两个catch块中添加对 this 的调用，这将帮助您更清楚地看到抛出了什么异常以及从哪里抛出。

score 0 · Accepted Answer

由于缺少 java 专业知识，此代码不会赢得价格。例如，即使它是正确的，我也会期望 OutOfMemory。最好是重写。尽管如此，我们都从小事开始。

给出文件的完整路径。同样在输出中，文件中可能缺少目录。
更好地使用 BufferedReader 等 io DateInputStream。
用 -1 初始化 i。更好用for (int i = 0; i < a.length; ++i)。
最好在循环外编译 Pattern。但删除匹配器。你也可以这样做if (s.contains(">")。. 不需要创建新文件。

代码：

const String encoding = "Windows-1252"; // Or "UTF-8" or leave away.
File f = new File("C:/input.txt");
BufferedReader dis = new BufferedReader(new InputStreamReader(
    new FileInputStream(f), encoding));

...

        int i= -1; // So i++ starts with 0.
        while ((inputline=dis.readLine()) != null) 
        {
            i++;
            a[i]=inputline.trim();
            //replaces unwanted patterns like /n with space
            // Not needed a[i]=a[i].replaceAll(separator,"");

java - 将一个文本文件解析为多个文本文件

2 回答 2

Related

Reference