ruby - How do I extract all possible methionine residues to the end from a protein sequence?

Question

I am looking to extract all Methionine residues to the end from a sequence.

In the below sequence:

MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG

Original Amino Acid sequence:

atgtttgaaatcgaagaacatatgaaggattcacaggtggaatacataattggccttcataatatcccattattgaatgcaactatttcagtgaagtgcacaggatttcaaagaactatgaatatgcaaggttgtgctaataaatttatgcaaagacattatgagaatcccctgacgggg

I want to extract from the sequence any M residue to the end, and obtain the following:

- MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
- MKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
- MNMQGCANKFMQRHYENPLTG
- MQGCANKFMQRHYENPLTG
- MQRHYENPLTG

With the data I am working with there are cases where there are a lot more "M" residues in the sequence.

The script I currently have is below. This script translates the genomic data first and then works with the amino acid sequences. This does the first two extractions but nothing further.

I have tried to repeat the same scan method after the second scan (See the commented part in the script below) but this just gives me an error:

private method scan called for #<Array:0x7f80884c84b0> No Method Error

I understand I need to make a loop of some kind and have tried, but all in vain. I have also tried matching but I haven't been able to do so - I think that you cannot match overlapping characters a single match method but then again I'm only a beginner...

So here is the script I'm using:

#!/usr/bin/env ruby

require "bio" 

def extract_open_reading_frames(input)

  file_output = File.new("./output.aa", "w")
  input.each_entry do |entry|
    i = 1
    entry.naseq.translate(1).scan(/M\w*/i) do |orf1|
      file_output.puts ">#{entry.definition.to_s} 5\'3\' frame 1:#{i}\n#{orf1}"
      i = i + 1 
      orf1.scan(/.(M\w*)/i) do |orf2|
        file_output.puts ">#{entry.definition.to_s} 5\'3\' frame 1:#{i}\n#{orf2}"
        i = i + 1
        #   orf2.scan(/.(M\w*)/i) do |orf3|
        #     file_output.puts ">#{entry.definition.to_s} 5\'3\' frame 1:#{i}\n#{orf3}"
        #     i = i + 1
        #   end
      end
    end 
  end
  file_output.close
end


biofastafile = Bio::FlatFile.new(Bio::FastaFormat, ARGF)

extract_open_reading_frames(biofastafile)

The script has to be in Ruby since this is part of a much longer script that is in Ruby.

score 3 · Accepted Answer

你可以做：

str = "MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG"
str.scan(/(?=(M.*))./).flatten
#=> ["MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG", MKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG", "MNMQGCANKFMQRHYENPLTG", "MQGCANKFMQRHYENPLTG", "MQRHYENPLTG"]

这是通过捕获从 M 开始的预读并一次推进一个字符来实现的。

score 1 · Accepted Answer

str = "MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG"

pos = 0

while pos < str.size
  if md = str.match(/M.*/, pos)
    puts md[0]
    pos = md.offset(0)[0] + 1
  else
    break
  end
end

--output:--
MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
MKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG
MNMQGCANKFMQRHYENPLTG
MQGCANKFMQRHYENPLTG
MQRHYENPLTG

md-- 代表 MatchData 对象。 match()-- 如果没有匹配则返回 nil，第二个参数是搜索的开始位置。 md[0]-- 是整个匹配（md[1]将是第一个带括号的组，等等）。 md.offset(n)-- 返回一个数组，其中包含字符串中的开始和结束位置md[n]。

在字符串“MMMM”上运行程序会产生输出：

MMMM
MMM
MM
M

我也尝试过匹配，但我没能做到——我认为你不能用单一的匹配方法匹配重叠的字符，但话说回来，我只是一个初学者......

是的，这是真的。 String#scan不会找到重叠的匹配项。找到匹配项后scan，从匹配项的末尾继续搜索。Perl 有一些方法可以备份正则表达式，我不知道 Ruby 是否有这些。

编辑：

对于 Ruby 1.8.7：

str = "MFEIEEHMKDSQVEYIIGLHNIPLLNATISVKCTGFQRTMNMQGCANKFMQRHYENPLTG"

pos = 0

while true
  str = str[pos..-1]

  if md = str.match(/M.*/)
    puts md[0]
    pos = md.offset(0)[0] + 1
  else
    break
  end
end

ruby - How do I extract all possible methionine residues to the end from a protein sequence?

2 回答 2

Related

Reference