我有一个大文件,我想以某种方式对其进行格式化。文件输入示例:
DVL1 03220 NP_004412.2 VANGL2 02758 Q9ULK5 in vitro 12490194
PAX3 09421 NP_852124.1 MEOX2 02760 NP_005915.2 in vitro;yeast 2-hybrid 11423130
VANGL2 02758 Q9ULK5 MAGI3 11290 NP_001136254.1 in vitro;in vivo 15195140
这就是我希望它变成的样子:
DVL1 03220 NP_004412 VANGL2 02758 Q9ULK5
PAX3 09421 NP_852124 MEOX2 02760 NP_005915
VANGL2 02758 Q9ULK5 MAGI3 11290 NP_001136254
总结一下:
- 如果该行有 1 个点,则删除该点及其后面的数字并添加一个 \t,因此输出行将只有 6 个制表符分隔值
- 如果该行有 2 个点,则将这些点连同它们后面的数字一起删除并添加一个 \t,因此输出行将只有 6 个制表符分隔值
- 如果该行没有点,则保留前 6 个制表符分隔值
我的想法目前是这样的:
for line in infile:
if "." in line: # thought about this and a line.count('.') might be better, just wasn't capable to make it work
transformed_line = line.replace('.', '\t', 2) # only replaces the dot; want to replace dot plus next first character
columns = transformed_line.split('\t')
outfile.write('\t'.join(columns[:8]) + '\n') # if i had a way to know the position of the dot(s), i could join only the desired columns
else:
columns = line.split('\t')
outfile.write('\t'.join(columns[:5]) + '\n') # this is fine
希望我能解释清楚。谢谢你们的努力。