2

我正在尝试编辑文件的格式,不对,它看起来像这样:

>Cluster 0
L07510
>Cluster 1
AF480591
AY457083
>Cluster 2
M88154
>Cluster 3
CP000924
L09161
>Cluster 4
AY742307
>Cluster 5
L09163
L09162
>Cluster 6
AF321086
> Cluster 7
DQ66288691

我想在 python 中写一些东西,它将通过每一行,停在说“>Cluster x”(x 是一个数字)的行处,然后将该数字添加到它后面的任何行中。然后,当到达新的“>集群 x”时,它会以新的 x 值重新开始。

所以它看起来像这样:

>群集0
0 L07510
>簇1
1 AF480591
1 AY457083
>群集2 2
M88154 >
群集 3 3
CP000924
3 L09161 >
群集 4
4 AY742307
> DQ288691







我在想我可以使用regex, 搜索">Cluster x"(正则表达式会像这样吗?('\>Cluster \d+')),然后让程序在匹配的正则表达式后面附加每一行\d+。我只是不确定如何实际写这个。任何帮助将不胜感激!

4

2 回答 2

2

经过测试

# If you're on a POSIX compliant system, and this script is marked as 
# executable, the following line will make this file be automatically 
# run by the Python interpreter rather than interpreted as a shell script
#!/usr/bin/env python

# We need the sys module to read arguments from the terminal
import sys

# Open the input file, default mode is 'r', readonly, which is a safe default
infile = open(sys.argv[1])

# Prepare a variable for the cluster number to be used within the loop
cluster = ''

# loop through all lines in the file, but first set up a list comprehension
# that strips the newline character off the line for each line that is read
for line in (line.strip() for line in infile):
    if line.startswith('>'):
        # string.split() splits on whitespace by default
        # we want the cluster number at index 1
        cluster = line.split()[1]

        # output this line to stdout unmodified
        print line

    else:
        # output any other line modified by adding the cluster number
        print cluster + ' ' + line

用法

$ python cluster_format.py input.txt > output.txt
于 2013-07-03T17:01:41.333 回答
1

哦,我喜欢解析。

这是交易:

current_cluster = ""
new_lines = ""

# assuming all this text is in a variable called lines
for line in lines.split("\n"):
    if line.startswith(">Cluster"):
        # 9 characters is ">Cluster "
        current_cluster=line[9:].strip()
    else:
        # otherwise, just take the line itself and prepend the current cluster
        line = "{} {}".format(current_cluster, line)

    new_lines += "{}\n".format(line)
于 2013-07-03T16:51:46.583 回答