perl - 如何将包含大量蛋白质序列的fasta文件拆分为多个文件

Question

我有一个包含数千个蛋白质序列的大型 fasta 文件。我想将此文件分成多个文件。

我正在为我的项目使用 ActivePerl

score 1 · Accepted Answer

您可以轻松地使用 awk 而不是 perl 来做到这一点。

awk '/^\>/{file=$0}{print >file".txt"}' your_fasta_file

score 1 · Accepted Answer

每个文件需要多少个序列？

你可以做这样的事情

#!/usr/bin/perl -w

my $fasta_file = "something.fasta";
my $seqs_per_file = 100;  # whatever your batch size

my $file_number = 1;  # our files will be named like "something.fasta.1"
my $seq_ctr = 0;

open(FASTA, $fasta_file) || die("can't open $fasta_file");

while(<FASTA>) {

    if(/^>/) {

       # open a new file if we've printed enough to one file
       if($seq_ctr++ % $seqs_per_file == 0) {
         close(OUT);
         open(OUT, "> " . $fasta_file . "." . $file_number++);
       }

    }

    print OUT $_;

 }

score 0 · Accepted Answer

这段代码是在 Java 中的。我不介意管理员是否将其从此处删除。但如果有帮助的话。:)

/**
 * This tool aims to chop the file in various parts based on the number of sequences required in one file.
 */
package devtools.utilities;

import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.lang3.StringUtils;

//import java.util.List;

/**
 * @author Arpit
 * 
 */
public class FileChopper {

    public void chopFile(String fileName, int numOfFiles) throws IOException {
        byte[] allBytes = null;
        String outFileName = StringUtils.substringBefore(fileName, ".fasta");

        try {
            allBytes = Files.readAllBytes(Paths.get(fileName));
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        String allLines = new String(allBytes, StandardCharsets.UTF_8);
        // Using a clever cheat with help from stackoverflow
        String cheatString = allLines.replace(">", "~>");
        cheatString = cheatString.replace("\\s+", "");
        String[] splitLines = StringUtils.split(cheatString, "~");
        int startIndex = 0;
        int stopIndex = 0;

        FileWriter fw = null;
        for (int j = 0; j < numOfFiles; j++) {

            fw = new FileWriter(outFileName.concat("_")
                    .concat(Integer.toString(j)).concat(".fasta"));
            if (j == (numOfFiles - 1)) {
                stopIndex = splitLines.length;
            } else {
                stopIndex = stopIndex + (splitLines.length / numOfFiles);
            }
            for (int i = startIndex; i < stopIndex; i++) {
                fw.write(splitLines[i]);
            }
            if (j < (numOfFiles - 1)) {
                startIndex = stopIndex;
            }
            fw.close();
        }

    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        FileChopper fc = new FileChopper();
        try {
            fc.chopFile("H:\\Projects\\Lactobacillus rhamnosus\\Hypothetical proteins sequence 405 LR24.fasta",5);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

}

score -2 · Accepted Answer

我知道你说过你想要在 Perl 中使用它。但是我已经多次使用 python 和 BioPython 来完成这个，我相信它与 BioPerl 相当（但更好:)。

import sys
import Bio
def write_file(input_file,split_number):
    #get file_counter and base name of fasta_file
    parent_file_base_name = input_file(".")[0]
    counter = 1

    #our first file name
    file = parent_file_base_name + "_" + str(counter) + ".fasta"

    #carries all of our records to be written
    joiner = []
    #enumerate huge fasta
    for num,record in enumerate(Bio.SeqIO.parse(input_file, "fasta"),start=1):
        #append records to our list holder
        joiner.append(">" + record.id + "\n" + str(record.seq))

        #if we have reached the maximum numbers to be in that file, write to a file, and then clear
        #record holder
        if num % split_number == 0:
            joiner.append("")
            with open(file,'w') as f:
                f.write("\n".join(joiner))    

            #change file name,clear record holder, and change the file count
            counter += 1
            file = parent_file_base_name + "_" + str(counter) + ".fasta"  
            joiner = []
      if joiner:
        joiner.append("")
        with open(file,'w') as f:
          f.write("\n".join(joiner))

if __name__ == "__main__":
    input_file = sys.argv[1]
    split_number = sys.argv[2]
    write_file(input_file,split_number)
    print "fasta_splitter.py is finished."

只需运行它

python script.py parent_fasta.fasta <how many records per file>

perl - 如何将包含大量蛋白质序列的fasta文件拆分为多个文件

4 回答 4

Related

Reference