我是 perl 初学者,请帮助我完成查询...我正在尝试从爆炸表中提取信息(如下所示):这是标准的爆炸表输入...我基本上想提取读取列表中的任何信息(看看我下面的第二个脚本,以了解我想要做什么)......无论如何,这正是我在第二个脚本中所做的:
输入:
1) 爆破台:
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
34.3 7.5 123828 GH8NFLV01A03QJ GH8NFLV01A03QJ rank=0239249 x=305.0 y=1945.5 length=452 1 XP_002639994 Hypothetical protein CBG10824 [Caenorhabditis briggsae] 3 0 52.1739130434783 32.6086956521739 46 452 367 10.1769911504425 12.5340599455041 111 248 79 124
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
37.7 0.98 33114 GH8NFLV01A0H5C GH8NFLV01A0H5C rank=0066011 x=298.0 y=2638.5 length=573 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] -3 0 73.5294117647059 52.9411764705882 34 573 703 5.93368237347295 4.83641536273115 131 232 273 306
103 1e-020 65742 GH8NFLV01A0MXI GH8NFLV01A0MXI rank=0124865 x=300.5 y=644.0 length=475 1 ABZ08973 hypothetical protein ALOHA_HF4000APKG6B14ctg1g18 [uncultured marine crenarchaeote HF4000_APKG6B14] 2 0 77.9411764705882 77.9411764705882 68 475 151 14.3157894736842 45.0331125827815 2 205 1 68
41.6 0.053 36083 GH8NFLV01A0QKX GH8NFLV01A0QKX rank=0071366 x=301.0 y=1279.0 length=526 1 XP_766153 hypothetical protein [Theileria parva strain Muguga] -1 0 66.6666666666667 56.6666666666667 30 526 304 5.70342205323194 9.86842105263158 392 481 31 60
45.4 0.003 78246 GH8NFLV01A0Z29 GH8NFLV01A0Z29 rank=0148293 x=304.0 y=1315.0 length=432 1 ZP_04111769 hypothetical protein bthur0007_56280 [Bacillus thuringiensis serovar monterrey BGSC 4AJ1] 3 0 51.8518518518518 38.8888888888889 54 432 193 12.5 27.979274611399 48 209 97 150
71.6 4e-011 97250 GH8NFLV01A14MR GH8NFLV01A14MR rank=0184885 x=317.5 y=609.5 length=314 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 92.5 92.5 40 314 311 12.7388535031847 12.8617363344051 193 312 13 52
58.2 5e-007 154555 GH8NFLV01A1KCH GH8NFLV01A1KCH rank=0309994 x=310.0 y=2991.0 length=267 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 82.051282051282 82.051282051282 39 267 311 14.6067415730337 12.540192926045 142 258 1 39
2)阅读列表:
GH8NFLV01A09B8
GH8NFLV01A02ED
etc
etc
3)我想要的输出:
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
我想要第一个列表中的信息子集,给定我要提取的读取名称列表(在第 4 列中找到)而不是散列读取列表(仅?)我想散列爆炸表本身,并使用(爆炸表的)第 4 列中的信息作为键来提取每个键的值,即使该键可能具有多个值(即:每个读取名称实际上可能有多个命中,或关联表中的爆炸结果),请记住,该值包括其中包含该键(读取名称)的整个行。
我的 greplist.pl 脚本可以做到这一点,但是非常慢,我认为,(如果我错了,请纠正我)通过将整个表加载到哈希中,这应该会大大加快速度......
感谢您的帮助。
我的剧本:破碎的剧本 (mambo5.pl)
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash = <DATA>;
close (DATA);
my $filename=$ARGV[0];
open(OUT, "> $filename.bololom");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "bingo!";
my $output =$hash{$readName};
print OUT "$output\n";
}
else
{
print "it aint workin\n";
#print %hash;
}
}
}
close (LIST);
缓慢而快速的作弊(有效)并且非常慢(我的爆炸表可能大约 400MB 到 2GB 大,我相信你可以明白它为什么这么慢)
#!/usr/bin/perl -w
##
# This script finds a list of names in a blast table and outputs the result in a new file
# name must exist and list must be correctly formatted
# will not output anything using a "normal" blast file, must be a table blast
# if you have the standard blast output use blast2table script
use strict;
my $filein=$ARGV[0] or die ("usage: ./listgrep.pl readslist blast_table\n");
my $db=$ARGV[1] or die ("usage: ./listgrep.pl readslist blast_table\n");
#open the reads you want to grep
my $read;
my $line;
open(READSLIST,$filein);
while($line=<READSLIST>)
{
if ($line=~/^(.*)$/)
{
$read = $1;
print "$read\n";
system("grep \"$read\" $db >$read\_.out\n");
}
#system("grep $read $db >$read\_.out\n");
}
system("cat *\_.out >$filein\_greps.txt\n");
system("rm *.out\n");
我不知道如何将第 4 列定义为键:也许我可以使用 split 函数,但我试图找到一种方法对超过 2 列的表执行此操作无济于事...请帮助!如果有一个简单的方法,请告诉我
谢谢 !