bioinformatics - 多序列比对。将多行格式转换为单行格式？

Question

我有一个多序列比对文件，其中散布着来自不同序列的行，就像 clustal 和其他流行的多序列比对工具输出的格式一样。它看起来像这样：

TGFb3_human_used_for_docking        ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|B3KVH9|B3KVH9_HUMAN              ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3UBH9|G3UBH9_LOXAF              ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3WTJ4|G3WTJ4_SARHA              ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY


TGFb3_human_used_for_docking        LRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN              LRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF              LRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA              LRSADTTHST-

每行以序列标识符开头，然后是字符序列（在这种情况下描述蛋白质的氨基酸序列）。每个序列分为几行，因此您会看到第一个序列（带有 ID TGFb3_human_used_for_docking）有两行。我想将其转换为每个序列都有一行的格式，如下所示：

TGFb3_human_used_for_docking        ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN              ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF              ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA              ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-

（在这个特定示例中，序列几乎相同，但通常它们不是！）

如何从多行多序列比对格式转换为单行？

score 0 · Accepted Answer

看起来你需要编写某种脚本来实现这一点。这是我用 Python 编写的一个简单示例。它不会像您的示例中那样漂亮地排列空白（如果您关心这一点，则必须弄乱formatting），但是它可以完成其余的工作

#Create a dictionary to accumulate full sequences
full_sequences = {}

#Loop through original file (replace test.txt with your file name)
#and add each line to the appropriate dictionary entry
with open("test.txt") as infile:
    for line in infile:
        line = [element.strip() for element in line.split()]
        if len(line) < 2:
            continue
        full_sequences[line[0]] = full_sequences.get(line[0], "") + line[1]

#Now loop through the dictionary and write each entry as a single line
outstr = ""
with open("test.txt", "w") as outfile:
    for seq in full_sequences:
        outstr += seq + "\t\t" + full_sequences[seq] + "\n"

    outfile.write(outstr)

bioinformatics - 多序列比对。将多行格式转换为单行格式？

1 回答 1

Related

Reference