python - 将公式应用于跨越多行的数据行

Question

我只是迈出了尝试学习一点 Python 的第一步。目前正在学习旨在教授生物信息学 python 技能的 Rosalind 在线课程。（顺便说一句，非常好，请参阅：rosalind.info）

我正在努力解决一个特定问题。我有一个 FASTA 格式的文件，其格式如下：

>Sequence_Header_1
ACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGT
>Sequence_Header_2
ACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGT

我需要计算文件每个条目（不包括标题）中 G 和 C 的百分比并返回这个数字，例如：

>Sequence_Header_1
48.75%
>Sequence_header_2
52.43%

到目前为止，我的代码是：

file = open("input.txt" , "r")
for line in file:
    if line.startswith(">"):
        print(line.rstrip())        
    else:
        print ('%3.2f' % (line.count('G')+line.count('C')/len(line)*100))
file.close()

这几乎是我需要它做的事情。我只是在序列数据跨越多行时遇到了麻烦。目前我得到文件中每一行的 % GC 内容，而不是为每个条目返回一个数字，例如：

>Sequence_Header_1
48.75%
52.65%
>Sequence_header_2
52.43%
50.25%

如何将我的公式应用于跨越多行的数据？

提前致谢，

score 1 · Accepted Answer

不是真正直接回答您的问题，但我认为这是一个更好的方法！如果你打算在 python 中做更多的生物信息学，看看 biopython。它将为您处理 fasta 文件和其他常见的序列操作（以及更多！）。

例如：

from Bio import SeqIO
from Bio.SeqUtils import GC

for rec in SeqIO.parse("input.txt", "fasta"):
    print rec.id,GC(rec.seq)

score 0 · Accepted Answer

您可以解析 fasta 格式并创建一个字典，其中 >ID 作为键，序列作为值，如下所示：

    from collections import defaultdict

    def parse_fasta(dataset):
        "Parses data in FASTA format, returning a dictionary of ID's and values"
        records = defaultdict(str)
        record_id = None
        for line in [l.strip() for l in dataset.splitlines()]:
            if line.startswith('>'):
                record_id = line[1:]
            else:
                records[record_id] += line
        return records

或者您可以稍微重写此代码并创建一个元组/列表。我更喜欢字典，因为它已经被索引了。如果您仍需要帮助，可以在 Rosalind 网站上找到我。

score 0 · Accepted Answer

我认为这就是你要找的：

# Read the input file
with open("input.txt", "r") as f:
    s = f.read()

# Loop through the Sequences
for i, b in enumerate(s.split("Sequence_Header_")):
    if not b: continue # At least the first one is empty 
                       # because there is no data before the first sequence
    # Join the lines with data
    data = "".join(b.split("\n")[1:])

    # Print the result
    print("Sequence_Header_{i}\n{p:.2f}%".format(
        i=i, p=(data.count('G')+data.count('C'))/len(data)*100))

注意：我在您的示例中找不到“>”符号。如果您的标题以 > 开头，那么您可以将代码重写为 s.split(">") 并且代码应该仍然可以。

score 0 · Accepted Answer

尝试保持运行计数，然后在找到新标题时重置此计数。

count = 0.0
line_length=0.0
seen_header = False
for line in open("input.txt" , "r"): #saves memory.
    if line.startswith('>'):
        if not seen_header:
            header = line.strip()
            seen_header=True
        if line_length > 0.0:
            print header,'\n',('%3.2f' % (100.0*count/line_length))
            count = 0.0
            line_length=0.0
            seen_header = False
    else:
        count += line.count('C') + line.count('C')
        line_length += len(line.strip())
print header,'\n',('%3.2f' % (100.0*count/line_length))

还要注意python中的除法，记住默认是整数除法。即5/2 = 2。您可以通过在变量中使用小数或float() 来避免这种情况。

编辑：做得更好，也应该是 line_length+=len(line.strip())，以避免将换行符“\n”计为两个字符。

score 0 · Accepted Answer

可能无法将整个文件保存在内存中。假设您不能s = f.read()一次完成所有操作，您需要保持对字母数和总字母数的连续计数，直到新序列开始。像这样的东西：

file = open("input.txt" , "r")
# for keeping count:
char_count = 0
char_total = 0
for line in file:
    if line.startswith(">"):
        if char_total != 0:
            # starting a new sequence, so calculate pct for last sequence
            char_pct = (char_count / char_total) * 100
            print ('%3.2f' % char_pct)
            # reset the count for the next sequence
            char_total = 0
            char_count = 0
        print(line.rstrip())        
    else:
        # continuing a sequence, so add to running counts
        char_count += line.count('G') + line.count('C')
        char_total += len(line)
file.close()

python - 将公式应用于跨越多行的数据行

5 回答 5

Related

Reference