1

我正在从事一个生物信息学项目,该项目涉及将不同的脚本和输入参数连接在一起,以分析下一代测序 Illumina 数据。我需要帮助调试第一个脚本。它的任务是解析 qseq 文件,将“好”样本转换为 fastq 格式,并将输出保存到临时 txt 文件(到磁盘)。

出于调试目的,管道方案如下所示:

# the input parameters are "fed" into the script and the output is written
# to tmp.txt
script01.pl [input parameters] > tmp.txt

我在终端中键入此命令,然后查看 tmp.txt 文件以检查脚本是否输出预期结果。但是对于整个项目,我有一个称为包装脚本的东西,它将所有脚本连接在一起。Script01 会将 end1 和 end2 数据的输出保存到 tmp 文件中,因为它们需要分别由脚本 02-06 处理。

这是代码。我添加了描述性评论,以便您了解发生了什么。另外,我没有 qseq 文件向您展示,只是假设我正在正确解析字段:

#!/usr/bin/perl
use strict; use warnings;

my $end1_temp=shift; # this variable is the location of the temporary file
my $end2_temp=shift; # this variable is the location of the temporary file
my $qseq_file=shift;
my $barcode=shift;

# declare and initialize variables
my $overhang= 'C[AT]GC';
my $end1_trim_offset= length($barcode)+4;
my $end2_trim_offset= 4; 

my %to_keep=(); # an empty hash
my @line=();    # an empty array

# open the qseq file, or exit
open (QSEQ_1_FILE, $qseq_file) or die "couldn't open $qseq_file for read\n";

# also, open (for output) the end1 temp file, so we can write to it while
# we process the end1 input file above
(open END_1_FILEOUT,">$end1_temp") or die "couldn't open $end1_temp for write\n";

# reads each line of the qseq data file one at a time. 
# assume each sample is kept on a separate line
while(<QSEQ_1_FILE>){
chomp; @line=split; # index and formats the qseq fields for parsing

# skip samples those that didn't pass QC
$line[10]>0 or next;

# process the samples here. look at only the samples that pass the quality 
# control and whose sequences contain the barcode+overhang at the beginning
# of the string, otherwise skip to the next sample in the data file 
# (i.e start the next iteration of the loop)

# trim the barcode+overhang from seq and qual
$line[8]= substr($line[8], $end1_trim_offset); #sequence
$line[9]= substr($line[9], $end1_trim_offset); #quality

# unique sequence identifier. don't worry about what $line[1-6] represent
# just know that $identifier is unique for each of the 'good' samples
my $identifier = $line[0].'-'.$line[1].'-'.$line[2].'_'.$line[3].'-'.$line[4].'-'.$line[5].'_#'.$line[6];

# store the identifiers of the 'good' samples in a hash.
# the hash should contain the identifier as the key and numbers (1,2,3,etc.)
# as the values. the following increments the hash values for each identifier.
$to_keep{$identifier}++;

# the following below is suppose to write information to the filehandle in fastq format:
# @[the identifier]/1
# sequence [ATGCAGTAAT...]
# +[the identifier]/1
# quality [ASCII characters]

print END_1_FILEOUT '@' . "$identifier/1\n$line[8]\n" . '+' . identifier/1\n$line[9]\n";

}
print "Found " . int(keys %to_keep) . " reads from end1\n";
# close the filehandles
close QSEQ_FILE; close END_1_FILEOUT;

此时,我有一个仅包含“好”序列标识符的哈希,并将 fastq 数据写入存储位置。Script01 假设将 fastq 输出保存到磁盘上的临时文件中。

$end1_temp = '~/tmp/sampleD1_end1.fq'end2_temp = '~/tmp/sampleD1_end2.fq'

问题:print END_1_FILE上面的行是将 fastq 数据写入文件句柄还是写入 $end1_temp 变量?我问是因为我需要将 $end1_temp 和 $end2_temp 变量传递给 script02。另外,为了调试,我如何查看 script01 的 fastq 输出?

这是我需要帮助的其余代码。它在同一个脚本上,直接遵循上面的代码:

# if the sequence is paired (has both end1 and end2 data), then the qseq_2 file exists and the conditions evaluates to true
if ($end2_temp) {
# changes the qseq filename from file name from end1 to end2 data
# don't worry about why it works
$qseq_file=~ s/'_1_'/'_2_'/;

open (QSEQ_2_FILE, $qseq_file) or die "could not open $qseq_file for read\n";

# open (for output) the end2 temp file, so we can write to it while we process
# the end2 input file above
open (END_2_FILEOUT, ">$end2_temp") or die "could not open $end2_temp for write\n";

# reads each line of the end2 file one at a time
while(<QSEQ_2_FILE>){
# skip comments
/^\#/ and next;

chomp; #keep the qseq fields on one line.
my @line= split; #indexes the qseq fields.

# unique sequence identifier that preserves the sequencing information
# in other words, samples from end1 and end2 will have the same unique
# identifier because they contain the exact same fields in columns 0-6 
# or $line[0-6]. also end1 and end2 have the same number of samples
my $identifier = $line[0] .'-'.$line[1].'-'.$line[2].'_'.$line[3].'-'.$line[4].'-'.$line[5].'_#'.$line[6];

# recall the %to_keep hash which stored the identifiers of the 'good' samples; 
# it was inside the QSEQ_1_FILE loop
# here I only want the end2 samples whose identifiers match the identifiers
# from the end1 samples
# skip sample where end1 didn't pass; does this work??
# the condition is suppose to evaluate to false if the identifiers don't match
$to_keep{$identifier} or next;

# trim the barcode+overhang from seq and qual
$line[8]=substr($line[8], $end2_trim_offset);  # sequence
$line[9]=substr($line[9], $end2_trim_offset);  # quality

print END_2_FILEOUT '@' . "$identifier/2\n$line[8]\n" . '+' . "$identifier/2\n$line[9]\n";

}

close QSEQ_2_FILE; close END_2_FILEOUT;

}

脚本01到此结束。此时,我应该已经将 end1 和 end2 的“好”样本的 fastq 数据写入单独的存储位置。Script01 假设将 fastq 输出保存到磁盘上的两个临时文件中。我想我的问题是出于调试目的,如何查看 script01 创建的 tmp 文件?

最后,当我script01.pl [input parameters] > tmp.txt在 Linux 终端中输入命令时,它会将 script01 的输出保存到 tmp.txt。“Found X reads from end1”是脚本在处理 end1 读取后打印的内容,其中 X 是 %to_keep 哈希中的读取次数。

当我查看 tmp.txt 时,它显示Found 0 reads from end1.因为它打印 0,这意味着哈希中没有存储任何内容。假设从 end1 存储大约 630 万条读数。有人可以帮我弄清楚为什么没有读取存储在哈希中吗?

我认为问题在于没有一个读取通过了我用来决定它们是否应该存储在哈希中的标准。另一个问题可能与我存储标识符的方式有关。

你们可以看看它,看看有没有我可能错过的东西?

谢谢。非常感谢对我的问题的任何建议或答案。

4

1 回答 1

1

问题:上面的 print END_1_FILE 行是将 fastq 数据写入文件句柄还是写入 $end1_temp 变量?

我假设你的意思是END_1_FILEOUT。这是一个文件的可写文件句柄,其名称存储在$end1_temp. 数据将被发送到文件中。

另外,为了调试,我如何查看 script01 的 fastq 输出?

你查看文件。根据您的问题,我认为该文件称为~/tmp/sampleD1_end1.fq. 你应该进去看看。

Script01 假设将 fastq 输出保存到磁盘上的两个临时文件中。我想我的问题是出于调试目的,如何查看 script01 创建的 tmp 文件?

和以前一样,输出应该在名称为$end1_tempand的文件中$end2_temp,大概是~/tmp/sampleD1_end1.fqand ~/tmp/sampleD1_end2.fq。用您的编辑器打开它们并查看。

最后,当我在 Linux 终端中输入命令 script01.pl [输入参数] > tmp.txt 时,它将 script01 的输出保存到 tmp.txt。当我查看 tmp.txt 时,它显示“Found 0 reads from end1”。由于它打印 0,这意味着哈希中没有存储任何内容。假设从 end1 存储大约 630 万条读数。有人可以帮我弄清楚为什么没有读取存储在哈希中吗?

这是基本调试。从您的代码中可以清楚地看出,您正在读取的 QSeq 文件是空的,或者您的测试$line[10] > 0 or next正在丢弃每条记录。print通过在 之后直接添加诊断语句split来显示 的值,这两种方法都很容易检查$line[10]。我敢打赌,您正在寻找错误的领域。

除此之外,你真的应该缩进你的代码。这将有助于我们比广泛的评论更好地理解它。还,

$line[10] > 0 or next

最好写成

next unless $line[10] > 0

open你应该使用词法文件句柄的三参数形式。

这是您的代码的整理版本,其中包含这些改进以及其他一些改进

use strict;
use warnings;

die qq(Insufficient arguments "@ARGV") unless @ARGV >= 4;

my ($end1_temp, $end2_temp, $qseq1_file, $barcode) = @ARGV;

my $overhang         = 'C[AT]GC';
my $end1_trim_offset = length($barcode) + 4;
my $end2_trim_offset = 4;

my $idformat = '%s-%s-%s_%s-%s-%s_#%s';

my %to_keep;

open my $qsec1, '<', $qseq1_file
    or die "couldn't open $qseq1_file for read\n";

open my $end1, '>', $end1_temp
    or die "couldn't open $end1_temp for write\n";

while (<$qsec1>) {

  my @line = split;
  next unless $line[10] > 0;

  $_ = substr($_, $end1_trim_offset) for @line[8,9];

  my $identifier = sprintf $idformat, @line;
  $to_keep{$identifier}++;

  print $end1
      '@' . "$identifier/1\n" .
      "$line[8]\n" .
      '+' . "$identifier/1\n" .
      "$line[9]\n";
}

close $qsec1;
close $end1;

printf "Found %d reads from end1\n", int keys %to_keep;


exit unless $end2_temp;

$qseq1_file =~ s/'_1_'/'_2_'/;

open my $qseq2, '<', $qseq1_file
    or die "could not open $qseq1_file for read\n";

open my $end2, '>', $end2_temp
    or die "could not open $end2_temp for write\n";

while (<$qseq2>) {

  next if /^#/;
  my @line = split;

  my $identifier = sprintf $idformat, @line;
  next unless $to_keep{$identifier};

  $_ = substr($_, $end2_trim_offset) for @line[8,9];

  print $end2
      '@' . "$identifier/2\n" .
      "$line[8]\n" .
      '+' . "$identifier/2\n" .
      "$line[9]\n";
}

close $qseq2;
close $end2;
于 2012-07-14T15:20:39.880 回答