所以新的任务是从网站(http://ceres.primus-fatum.de/~fate/scriptsprachen/uniprotDB_part.txt)下载文件,然后我必须执行一个子程序来逐行保存然后搜索对于 ID 和 Sq .. 并且所有这些都应该保存在新的 Txt 文件中:1. Id Line 应该首先是,2. SQ 最后是 3. 其他所有东西都应该在 ID 和 SQ 之间,最后应该是 Salsh .. ..这是一个例子..但文件有1000个例子
预期输出示例:
ID 001R_FRG3G Reviewed; 256 AA. -> ID First place *****
AC Q6GZX4;
DT 28-JUN-2011, integrated into UniProtKB/Swiss-Prot.
DT 19-JUL-2004, sequence version 1.
DT 18-APR-2012, entry version 24.
DE RecName: Full=Putative transcription factor 001R;
GN ORFNames=FV3-001R;
OS Frog virus 3 (isolate Goorha) (FV-3).
OC Viruses; dsDNA viruses, no RNA stage; Iridoviridae; Ranavirus.
OX NCBI_TaxID=654924;
OH NCBI_TaxID=8295; Ambystoma (mole salamanders).
OH NCBI_TaxID=30343; Hyla versicolor (chameleon treefrog).
OH NCBI_TaxID=8316; Notophthalmus viridescens (Eastern newt) (Triturus viridescens).
OH NCBI_TaxID=8404; Rana pipiens (Northern leopard frog).
OH NCBI_TaxID=45438; Rana sylvatica (Wood frog).
RN [1]
RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX PubMed=15165820; DOI=10.1016/j.virol.2004.02.019;
RA Tan W.G., Barkman T.J., Gregory Chinchar V., Essani K.;
RT "Comparative genomic analyses of frog virus 3, type species of the
RT genus Ranavirus (family Iridoviridae).";
RL Virology 323:70-84(2004).
CC -!- FUNCTION: Transcription activation (Potential).
CC -----------------------------------------------------------------------
CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC Distributed under the Creative Commons Attribution-NoDerivs License
CC -----------------------------------------------------------------------
DR EMBL; AY548484; AAT09660.1; -; Genomic_DNA.
DR RefSeq; YP_031579.1; NC_005946.1.
DR ProteinModelPortal; Q6GZX4; -.
DR GeneID; 2947773; -.
DR ProtClustDB; CLSP2511514; -.
DR GO; GO:0006355; P:regulation of transcription, DNA-dependent; IEA:UniProtKB-KW.
DR GO; GO:0046782; P:regulation of viral transcription; IEA:InterPro.
DR GO; GO:0006351; P:transcription, DNA-dependent; IEA:UniProtKB-KW.
DR InterPro; IPR007031; Poxvirus_VLTF3.
DR Pfam; PF04947; Pox_VLTF3; 1.
PE 4: Predicted;
KW Activator; Complete proteome; Reference proteome; Transcription;
KW Transcription regulation.
FT CHAIN 1 256 Putative transcription factor 001R.
FT /FTId=PRO_0000410512.
FT COMPBIAS 14 17 Poly-Arg.
SQ SEQUENCE 256 AA; 29735 MW; B4840739BF7D4121 CRC64; -> SQ at LAST and then "//"
MAFSAEDVLK EYDRRRRMEA LLLSLYYPND RKLLDYKEWS PPRVQVECPK APVEWNNPPS
EKGLIVGHFS GIKYKGEKAQ ASEVDVNKMC CWVSKFKDAM RRYQGIQTCK IPGKVLSDLD
AKIKAYNLTV EGVEGFVRYS RVTKQHVAAF LKELRHSKQY ENVNLIHYIL TDKRVDIQHL
EKDLVKDFKA LVESAHRMRQ GHMINVKYIL YQLLKKHGHG PDGPDILTVK TGSKGVLYDD
SFRKIYTDLG WKFTPL
//
我试过这个:
use strict;
use warnings;
sub main {
my @file_data=();
my $motif ='';
my $protein_seq='';
my $h= '[VLIM]';
my $s= '[AG]';
my $x= '[ARNDCEQGHILKMFPSTWYV]';
my $regexp = "($I){1}D"; ->motif to be searched is ID
my $regexp = "($S){1}Q"; ->motif to be searched is SQ
my @locations=();
@file_data= get_file_data("seq.txt");
$protein_seq= extract_sequence(@file_data);
foreach my $line(@file_data){
if ($motif=~ /$regexp/){
print "found motif \n\n";
} else {
print "not found \n\n";
}
}
记录要输出的主题的位置/位置..
@locations= match_position($regexp,$seq);
if (@locations){
print "Searching for motifs $regexp \n";
print "Catalytic site is at location:\n";
}
else{
print "motif not found \n\n";
}
exit;
sub get_file_data{
#body...
my ($filename)=@_;
my $sequence='';
foreach my $line(@file_data){
if ($line=~ /^\s*$/){
next;
}
elsif ($line=~ /^\s*#/){
next;
}
elsif ($line=~ /^>/){
next;
}
else {
$sequence.=$line;
}
}
$sequence=~ s/\s//g;
return $sequence;
}
sub(match_positions) {
my ($regexp, $sequence)=@_;
use strict;
my @position=();
while ($sequence=~ /$regexp/ig){
push (@position, $-[0]);
}
return @position;
}
}
main();