perl - 模拟 RNA 合成的 Perl 程序

Question

寻找有关如何处理我的 Perl 编程作业以编写 RNA 合成程序的建议。我总结并概述了下面的程序。具体来说，我正在寻找对以下块的反馈（我将编号以方便参考）。我已经阅读了 Andrew Johnson 的 Elements of Programming with Perl （好书）的第 6 章。我还阅读了 perlfunc 和 perlop pod-pages，没有任何关于从哪里开始的内容。

程序描述：程序应从命令行读取输入文件，将其翻译成 RNA，然后将 RNA 转录成大写的单字母氨基酸名称序列。

接受以命令行命名的文件

这里我将使用 <> 运算符

检查以确保文件仅包含 acgt 或 die

if ( <> ne [acgt] ) { die "usage: file must only contain nucleotides \n"; }

将 DNA 转录为 RNA（每个 A 被 U 替换，T 被 A 替换，C 被 G 替换，G 被 C 替换）

不知道该怎么做
从“AUG”的第一次出现开始，将此转录并分解为 3 个字符“密码子”

不确定，但我想这是我将开始 %hash 变量的地方？
取 3 个字符“密码子”并给它们一个单字母符号（大写的单字母氨基酸名称）

使用为键分配值（这里有 70 种可能性，所以我不确定在哪里存储或如何访问）
如果遇到间隙，则启动新行并重复该过程

不确定，但我们可以假设差距是三的倍数。
我以正确的方式接近这个吗？是否有一个我忽略的 Perl 函数可以简化主程序？

笔记

必须是自包含程序（密码子名称和符号的存储值）。

每当程序读取一个没有符号的密码子时，这是 RNA 中的一个缺口，它应该开始一个新的输出行，并从下一次出现的“AUG”开始。为简单起见，我们可以假设间隙总是三的倍数。

在我花任何额外的时间进行研究之前，我希望能确认我正在采取正确的方法。感谢您花时间阅读并分享您的专业知识！

score 5 · Accepted Answer

1. here I will use the <> operator

好的，您的计划是逐行读取文件。走的时候不要忘记chomp每一行，否则你的序列中会出现换行符。

2. Check to make sure the file only contains acgt or die

if ( <> ne [acgt] ) { die "usage: file must only contain nucleotides \n"; }

在 while 循环中，<>运算符将读取的行放入特殊变量$_中，除非您明确指定它 ( my $line = <>)。

在上面的代码中，您正在从文件中读取一行并将其丢弃。您需要保存该行。

此外，ne运算符比较两个字符串，而不是一个字符串和一个正则表达式。您将需要!~此处的运算符（或=~带有否定字符类的运算符[^acgt]。如果您需要测试不区分大小写，请查看i正则表达式匹配的标志。

3. Transcribe the DNA to RNA (Every A replaced by U, T replaced by A, C replaced by G, G replaced by C).

正如 GWW 所说，检查你的生物学。T->U 是转录的唯一步骤。您会发现tr(transliterate) 运算符在这里很有帮助。

4. Take this transcription & break it into 3 character 'codons' starting at the first occurance of "AUG"

not sure but I'm thinking this is where I will start a %hash variables?

我会在这里使用缓冲区。while(<>)在循环外定义一个标量。用于index匹配“AUG”。如果你没有找到它，把最后两个基放在那个标量上（你可以用substr $line, -2, 2它）。在循环的下一次迭代中，将.=行附加到这两个基础上，然后再次测试“AUG”。如果您成功了，您就会知道在哪里，因此您可以标记该地点并开始翻译。

5. Take the 3 character "codons" and give them a single letter Symbol (an uppercase one-letter amino acid name)

Assign a key a value using (there are 70 possibilities here so I'm not sure where to store or how to access)

同样，正如 GWW 所说，构建一个哈希表：

%codons = ( AUG => 'M', ...).

然后您可以使用（例如）split构建您正在检查的当前行的数组，一次构建三个元素的密码子，并从哈希表中获取正确的氨基酸代码。

6.If a gap is encountered a new line is started and process is repeated

not sure but we can assume that gaps are multiples of threes.

看上面。您可以使用测试是否存在间隙exists $codons{$current_codon}。

7. Am I approaching this the right way? Is there a Perl function that I'm overlooking that can simplify the main program?

你知道，看着上面，它似乎太复杂了。我建造了一些积木；子程序read_codon和translate：我认为它们极大地帮助了程序的逻辑。

我知道这是一项家庭作业，但我认为它可能会帮助您了解其他可能的方法：

use warnings; use strict;
use feature 'state';


# read_codon works by using the new [state][1] feature in Perl 5.10
# both @buffer and $handle represent 'state' on this function:
# Both permits abstracting reading codons from processing the file
# line-by-line.
# Once read_colon is called for the first time, both are initialized.
# Since $handle is a state variable, the current file handle position
# is never reset. Similarly, @buffer always holds whatever was left
# from the previous call.
# The base case is that @buffer contains less than 3bp, in which case
# we need to read a new line, remove the "\n" character,
# split it and push the resulting list to the end of the @buffer.
# If we encounter EOF on the $handle, then we have exhausted the file,
# and the @buffer as well, so we 'return' undef.
# otherwise we pick the first 3bp of the @buffer, join them into a string,
# transcribe it and return it.

sub read_codon {
    my ($file) = @_;

    state @buffer;
    open state $handle, '<', $file or die $!;

    if (@buffer < 3) {
        my $new_line = scalar <$handle> or return;
        chomp $new_line;
        push @buffer, split //, $new_line;
    }

    return transcribe(
                       join '', 
                       shift @buffer,
                       shift @buffer,
                       shift @buffer
                     );
}

sub transcribe {
    my ($codon) = @_;
    $codon =~ tr/T/U/;
    return $codon;
}


# translate works by using the new [state][1] feature in Perl 5.10
# the $TRANSLATE state is initialized to 0
# as codons are passed to it, 
# the sub updates the state according to start and stop codons.
# Since $TRANSLATE is a state variable, it is only initialized once,
# (the first time the sub is called)
# If the current state is 'translating',
# then the sub returns the appropriate amino-acid from the %codes table, if any.
# Thus this provides a logical way to the caller of this sub to determine whether
# it should print an amino-acid or not: if not, the sub will return undef.
# %codes could also be a state variable, but since it is not actually a 'state',
# it is initialized once, in a code block visible form the sub,
# but separate from the rest of the program, since it is 'private' to the sub

{
    our %codes = (
        AUG => 'M',
        ...
    );

    sub translate {
        my ($codon) = @_ or return;

        state $TRANSLATE = 0;

        $TRANSLATE = 1 if $codon =~ m/AUG/i;
        $TRANSLATE = 0 if $codon =~ m/U(AA|GA|AG)/i;

        return $codes{$codon} if $TRANSLATE;
    }
}

score 3 · Accepted Answer

我可以就你的一些观点给你一些提示。

我认为您的第一个目标应该是逐个字符地解析文件，确保每个字符都是有效的，将它们分组为三个核苷酸的集合，然后处理您的其他目标。

我认为您的生物学也有点偏离，当您将 DNA 转录为 RNA 时，您需要考虑涉及哪些链。您可能不需要在转录步骤中“补充”您的碱基。

2. 您应该在逐个字符解析文件时检查这一点。

3. 您可以使用循环和一些 if 语句或哈希来执行此操作

4. 当您逐个字符地读取文件时，这可能会通过计数器来完成。因为您需要在每 3 个字符后插入一个空格。

5. 这将是使用基于氨基酸密码子表的哈希的好地方。

6. 在解析文件时，您必须寻找间隙字符。这似乎与您的 #2 要求相矛盾，因为程序说您的文本只能包含 ATGC。

有很多 perl 函数可以使这更容易。还有 perl 模块，例如 bioperl。但我认为使用其中一些可能会破坏您分配的目的。

score 1 · Accepted Answer

1

查看BioPerl并浏览源模块以获取有关如何进行操作的指标。

于 2010-11-06T05:17:33.927 回答

perl - 模拟 RNA 合成的 Perl 程序

笔记

3 回答 3

Related

Reference