我有一个包含数千个蛋白质序列的大型 fasta 文件。我想将此文件分成多个文件。
我正在为我的项目使用 ActivePerl
您可以轻松地使用 awk 而不是 perl 来做到这一点。
awk '/^\>/{file=$0}{print >file".txt"}' your_fasta_file
每个文件需要多少个序列?
你可以做这样的事情
#!/usr/bin/perl -w
my $fasta_file = "something.fasta";
my $seqs_per_file = 100; # whatever your batch size
my $file_number = 1; # our files will be named like "something.fasta.1"
my $seq_ctr = 0;
open(FASTA, $fasta_file) || die("can't open $fasta_file");
while(<FASTA>) {
if(/^>/) {
# open a new file if we've printed enough to one file
if($seq_ctr++ % $seqs_per_file == 0) {
close(OUT);
open(OUT, "> " . $fasta_file . "." . $file_number++);
}
}
print OUT $_;
}
这段代码是在 Java 中的。我不介意管理员是否将其从此处删除。但如果有帮助的话。:)
/**
* This tool aims to chop the file in various parts based on the number of sequences required in one file.
*/
package devtools.utilities;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.lang3.StringUtils;
//import java.util.List;
/**
* @author Arpit
*
*/
public class FileChopper {
public void chopFile(String fileName, int numOfFiles) throws IOException {
byte[] allBytes = null;
String outFileName = StringUtils.substringBefore(fileName, ".fasta");
try {
allBytes = Files.readAllBytes(Paths.get(fileName));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
String allLines = new String(allBytes, StandardCharsets.UTF_8);
// Using a clever cheat with help from stackoverflow
String cheatString = allLines.replace(">", "~>");
cheatString = cheatString.replace("\\s+", "");
String[] splitLines = StringUtils.split(cheatString, "~");
int startIndex = 0;
int stopIndex = 0;
FileWriter fw = null;
for (int j = 0; j < numOfFiles; j++) {
fw = new FileWriter(outFileName.concat("_")
.concat(Integer.toString(j)).concat(".fasta"));
if (j == (numOfFiles - 1)) {
stopIndex = splitLines.length;
} else {
stopIndex = stopIndex + (splitLines.length / numOfFiles);
}
for (int i = startIndex; i < stopIndex; i++) {
fw.write(splitLines[i]);
}
if (j < (numOfFiles - 1)) {
startIndex = stopIndex;
}
fw.close();
}
}
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
FileChopper fc = new FileChopper();
try {
fc.chopFile("H:\\Projects\\Lactobacillus rhamnosus\\Hypothetical proteins sequence 405 LR24.fasta",5);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
我知道你说过你想要在 Perl 中使用它。但是我已经多次使用 python 和 BioPython 来完成这个,我相信它与 BioPerl 相当(但更好:)。
import sys
import Bio
def write_file(input_file,split_number):
#get file_counter and base name of fasta_file
parent_file_base_name = input_file(".")[0]
counter = 1
#our first file name
file = parent_file_base_name + "_" + str(counter) + ".fasta"
#carries all of our records to be written
joiner = []
#enumerate huge fasta
for num,record in enumerate(Bio.SeqIO.parse(input_file, "fasta"),start=1):
#append records to our list holder
joiner.append(">" + record.id + "\n" + str(record.seq))
#if we have reached the maximum numbers to be in that file, write to a file, and then clear
#record holder
if num % split_number == 0:
joiner.append("")
with open(file,'w') as f:
f.write("\n".join(joiner))
#change file name,clear record holder, and change the file count
counter += 1
file = parent_file_base_name + "_" + str(counter) + ".fasta"
joiner = []
if joiner:
joiner.append("")
with open(file,'w') as f:
f.write("\n".join(joiner))
if __name__ == "__main__":
input_file = sys.argv[1]
split_number = sys.argv[2]
write_file(input_file,split_number)
print "fasta_splitter.py is finished."
只需运行它
python script.py parent_fasta.fasta <how many records per file>