0

我在格式的文件中有一系列字符串:

>HEADER_Text1
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada

我正在尝试找到一个正则表达式模式,它将删除>下一个字符之间的字符下方的换行符>。所以最终结果看起来像:

>HEADER_Text1
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada

有谁知道我怎么能想出一个正则表达式模式来做到这一点?

旁注:这种格式作为 FASTA 格式在计算科学中很常见。

谢谢!

4

5 回答 5

1

如评论中所述,您最好的选择是使用现有的 FASTA 解析器。为什么不?

以下是我将如何加入基于领先大于的行:

def joinup(f):
    buf = []
    for line in f:
        if line.startswith('>'):
            if buf:
                yield " ".join(buf)
            yield line.rstrip()
            buf = []
        else:
            buf.append(line.rstrip())
    yield " ".join(buf)

for joined_line in joinup(open("...")):
    # blah blah...
于 2013-02-10T20:04:25.357 回答
0

鉴于 > 总是被认为是新行的第一个字符

"\n([^>])" 与 "\1"

于 2013-02-10T18:27:42.987 回答
0

这也应该有效。

sampleText=""">HEADER_Text1 这里有信息,yada yada yada 这里有更多信息,yada yada yada 这里还有更多信息,yada yada yada

HEADER_Text2 这里有信息,yada yada yada 这里有更多信息,yada yada yada 这里还有更多信息,yada yada yada HEADER_Text3 这里有信息,yada yada yada 这里有更多信息,yada yada yada 这里还有更多信息,yada yada yada"" “”

cleartext = re.sub ("\n(?!>)", "", sampleText)

打印明文

HEADER_Text1这里有信息,yada yada yada这里有更多信息,yada yada yada这里有更多信息,yada yada yada HEADER_Text2这里有信息,yada yada yada这里有更多信息,yada yada yada这里有更多信息,yada yada yada HEADER_Text3这里有信息,yada yada yada更多信息这里, yada yada yadaEven 更多信息在这里, yada yada yada

于 2013-02-10T19:29:06.567 回答
0

你真的不想要正则表达式。而对于这项工作,python 和 biopython 是多余的。如果这实际上是 FASTQ 格式,只需使用sed

sed '/^>/ { N; N; N; s/\n/ /2g }' file

结果:

>HEADER_Text1
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
于 2013-02-10T23:29:34.133 回答
0

您不必使用正则表达式:

[ x.startswith('>') and x or x.replace('\n','') for x in f.readlines()]    

应该管用。

In [43]: f=open('test.txt')

In [44]: contents=[ x.startswith('>') and x or x.replace('\n','') for x in f.readlines()]                                                                                   

In [45]: contents
Out[45]: 
['>HEADER_Text1\n',
 'Information here, yada yada yada',
 'Some more information here, yada yada yada',
 'Even some more information here, yada yada yada',
 '>HEADER_Text2\n',
 'Information here, yada yada yada',
 'Some more information here, yada yada yada',
 'Even some more information here, yada yada yada',
 '>HEADER_Text3\n',
 'Information here, yada yada yada',
 'Some more information here, yada yada yada',
 'Even some more information here, yada yada yada']
于 2013-02-10T18:54:49.983 回答