python - 如何计算fasta格式文件中的氨基酸？

Question

我找到了解析 fasta frmated 文件的代码。我需要计算每个序列中有多少个A、T、G等，例如：

>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
MRMRGRRLLPIIL

在这个序列中：

M - 2
R - 4
G - 1
L - 3
I - 2
P - 1

代码非常简单：

 def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}

  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('\n').rstrip('*')

  print "%d sequences found" % len(order)
  return order, sequences

x, y = FASTA("drosoph_b.fasta")

但是我怎样才能计算那些氨基酸呢？我不想使用 BioPython，我想知道如何使用，例如count...

score 5 · Accepted Answer

正如katrielalex 指出的那样，collections.Counter非常适合这项任务：

In [1]: from collections import Counter

In [2]: Counter('MRMRGRRLLPIIL')
Out[2]: Counter({'R': 4, 'L': 3, 'M': 2, 'I': 2, 'G': 1, 'P': 1})

您可以将其应用于sequences代码中的 dict 值。

但是，我建议不要在现实生活中使用此代码。BioPython 之类的库做得更好。例如，您展示的代码将生成相当庞大的数据结构。由于 FASTA 文件有时非常大，您可能会耗尽内存。此外，它不会以最好的方式处理可能的异常。

最后，使用库可以节省您的时间。

BioPython 示例代码：

In [1]: from Bio import SeqIO

In [2]: from collections import Counter

In [3]: for entry in SeqIO.parse('1.fasta', 'fasta'):
   ...:     print Counter(entry.seq)
   ...:     
Counter({'R': 4, 'L': 3, 'I': 2, 'M': 2, 'G': 1, 'P': 1})

score 4 · Accepted Answer

这可以使用非常简单的 bash 命令获得，我的答案如下

cat input.fasta #my input file
>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
    MRMRGRRLLPIIL

cat input.fasta | grep -v ">" | fold -w1 | sort | uniq -c

输出：

fold -w1 在每个字符处拆分，您对它们进行排序并计算唯一的字符

score 2 · Accepted Answer

评论中提到的 katrielalex 的替代方法是使用另一个字典，代码如下

def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}
  counts = {}

  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('\n').rstrip('*')
      for aa in sequences[name]:
        if aa in counts:
            counts[aa] = counts[aa] + 1
        else:
            counts[aa] = 1  


  print "%d sequences found" % len(order)
  print counts
  return order, sequences

x, y = FASTA("drosoph_b.fasta")

这输出：

1 sequences found
{'G': 1, 'I': 2, 'M': 2, 'L': 3, 'P': 1, 'R': 4}

score 0 · Accepted Answer

# your previous code here

x, y = FASTA("drosoph_b.fasta")

import collections

for order in x:
  print order, ':',
  print '\n'.join('%s - %d' % (k, v) for k, v in collections.Counter(y[order]).iteritems())
  print

python - 如何计算fasta格式文件中的氨基酸？

4 回答 4

Related

Reference