-1

我有 2 个文件 annotation.txt 和motif_list.txt。我被困在一个点,我必须在匹配一个模式后打印下一个连续的行,直到下一个模式出现。每个模式之后的行数是可变的。该模式的末尾总是有“/Homer”。需要一点帮助。谢谢

注释.txt

AT1G10720(BSD)/col-AT1G10720-DAP-Seq(GSE60143)/Homer
gene1
gene2
gene3
ERF3(AP2EREBP)/colamp-ERF3-DAP-Seq(GSE60143)/Homer
gene1
gene5
gene4
gene10
--------------------------------

主题列表.txt

AT1G10720(BSD)/col-AT1G10720-DAP-Seq(GSE60143)/Homer    BSD
E2F4(E2F)/K562-E2F4-ChIP-Seq(GSE31477)/Homer    ERF
ERF3(AP2EREBP)/colamp-ERF3-DAP-Seq(GSE60143)/Homer    ERF

代码:

    import re
    file1 = open("annotation.txt", "r")
    file2 = open("motif_list.txt", "r")
    annot=file1.readlines()
    motif=file2.readlines()
    for i in annot:
    if re.search("/Homer", i):
        for j in motif:
            motif_info=j.split("\t")
            if motif_into[0]==i:
                print the next few lines until the next motif comes, "\t", i, "\t", motif_into[1]

期望的输出:

gene1    AT1G10720(BSD)/col-AT1G10720-DAP-Seq(GSE60143)/Homer    BSD
gene2    AT1G10720(BSD)/col-AT1G10720-DAP-Seq(GSE60143)/Homer    BSD
gene3    AT1G10720(BSD)/col-AT1G10720-DAP-Seq(GSE60143)/Homer    BSD
gene1    ERF3(AP2EREBP)/colamp-ERF3-DAP-Seq(GSE60143)/Homer    ERF
gene5    ERF3(AP2EREBP)/colamp-ERF3-DAP-Seq(GSE60143)/Homer    ERF
gene4    ERF3(AP2EREBP)/colamp-ERF3-DAP-Seq(GSE60143)/Homer    ERF
gene10    ERF3(AP2EREBP)/colamp-ERF3-DAP-Seq(GSE60143)/Homer    ERF
4

1 回答 1

1

您可以将您的motif_list 放在包含公共部分的字典中(即在/Homer 之前),并使用它在annotation.txt 中的每一行(不是模式标题)中携带行扩展模式:

注意:我使用字符串进行测试,但您可以从文件中获取实际数据(如大字符串下方的注释所示)

设置:

motifs = """AT1G10720(BSD)/col-AT1G10720-DAP-Seq(GSE60143)/Homer    BSD
E2F4(E2F)/K562-E2F4-ChIP-Seq(GSE31477)/Homer    ERF
ERF3(AP2EREBP)/colamp-ERF3-DAP-Seq(GSE60143)/Homer    ERF""".split("\n")

# with f as open('motif_list.txt.txt'):
#    motifs = f.read().split("\n")
                                                                   
annotations = """AT1G10720(BSD)/col-AT1G10720-DAP-Seq(GSE60143)/Homer
gene1
gene2
gene3
ERF3(AP2EREBP)/colamp-ERF3-DAP-Seq(GSE60143)/Homer
gene1
gene5
gene4
gene10""".split("\n")

# with f as open('annotation.txt'):
#    annotations = f.read().split("\n")

过程:

HOMER = "/Homer"
motifDict = dict(m.split(HOMER,1) for m in motifs)
pattern = ""
for anno in annotations:
    if HOMER in anno:
        pattern = anno+motifDict[anno.split(HOMER,1)[0]]
    else:
        print(anno + "\t" + pattern)

输出:

gene1   AT1G10720(BSD)/col-AT1G10720-DAP-Seq(GSE60143)/Homer    BSD
gene2   AT1G10720(BSD)/col-AT1G10720-DAP-Seq(GSE60143)/Homer    BSD
gene3   AT1G10720(BSD)/col-AT1G10720-DAP-Seq(GSE60143)/Homer    BSD
gene1   ERF3(AP2EREBP)/colamp-ERF3-DAP-Seq(GSE60143)/Homer    ERF
gene5   ERF3(AP2EREBP)/colamp-ERF3-DAP-Seq(GSE60143)/Homer    ERF
gene4   ERF3(AP2EREBP)/colamp-ERF3-DAP-Seq(GSE60143)/Homer    ERF
gene10  ERF3(AP2EREBP)/colamp-ERF3-DAP-Seq(GSE60143)/Homer    ERF
于 2021-08-24T19:01:13.227 回答