3

我有一个包含以下内容的纯文本文件:

@M00964: XXXXX
YYY
+
ZZZZ 
@M00964: XXXXX
YYY
+
ZZZZ
@M00964: XXXXX
YYY
+
ZZZZ

我想将其读入根据 ID 代码拆分为项目的列表@M00964,即:

['@M00964: XXXXX
YYY
+
ZZZZ' 
'@M00964: XXXXX
YYY
+
ZZZZ'
'@M00964: XXXXX
YYY
+
ZZZZ']

我试过使用

in_file = open(fileName,"r")
sequences = in_file.read().split('@M00964')[1:]
in_file.close()

但这会删除 ID 序列@M00964。有什么办法可以保留这个 ID 序列吗?

作为另一个问题,是否有任何方法可以在列表中维护空白(而不是 /n 符号)。

我的总体目标是读入这组项目,以前 2 个项目为例,然后将它们写回保持所有原始格式的文本文件。

4

3 回答 3

3

如果您的文件很大并且您不想将整个内容保存在内存中,则可以使用此辅助函数迭代单个记录:

def chunk_records(filepath)
    with open(filepath, 'r') as f:
        record = []
        for line in f:
            # could use regex for more complicated matching
            if line.startswith('@M00964') and record:
                yield ''.join(record)
                record = []
            else:
                record.append(line)
        if record:
            yield ''.join(record)

像这样使用它

for record in chunk_records('/your/filename.txt'):
    ...

或者,如果你想把整个事情记在内存中:

records = list(chunk_records('/your/filename.txt'))
于 2014-03-25T15:33:41.747 回答
0

只需在 @ 符号上拆分:

with open(fileName,"r") as in_file:
    sequences = in_file.read().replace("@","###@").split('###')
于 2014-03-25T15:23:28.977 回答
0

具体到您的示例,您不能只执行以下操作:

in_file = open(fileName, 'r')
file = in_file.readlines()

new_list = [''.join(file[i*4:(i+1)*4]) for i in range(int(len(file)/4))]
list_no_n = [item.replace('\n','') for item in new_list]

print new_list
print list_no_n

[扩展形式]

new_list = []
for i in range(int(len(file)/4)): #Iterates through 1/4 of the length of the file lines.
                                  #This is because we will be dealing in groups of 4 lines
    new_list.append(''.join(file[i*4:(i+1)*4])) #Joins four lines together into a string and adds it to the new_list

[写入新文件]

write_list = ''.join(new_list).split('\n')
output_file = open(filename, 'w')
output_file.writelines(write_list)
于 2014-03-25T15:25:09.723 回答