0

我想解决我的问题,即:当我的行满足条件时,打印出从该行开始直到该行+值的所有行

我有一个看起来像这样的代码:

import re
##
def round_down(num):
    return num - (num%100000)  ###reduce search space
##
##
##def Filter(infile, outfile):
##out = open(outfile,'w')
infile = open('AT_rich','r')
cov = open('30x_good_ok_bad_0COV','r') ###File with non platinum regions
#platinum_region = [row for row in Pt]
platinum_region={}  ### create dictionary for non platinum regions. Works fast
platinum_region['chrM']={}
platinum_region['chrM'][0]=[]
ct=0
for region in infile:
    (chr,start,end,types,length)= region.strip().split()
    start=int(start)
    end=int(end)
    length = int(length)
    rounded_start=round_down(start)
##
    if not (chr in platinum_region):
        platinum_region[chr]={}
    if not (rounded_start in platinum_region[chr]):
        platinum_region[chr][rounded_start]=[]
    platinum_region[chr][rounded_start].append({'start':start,'end':end,'length':length})
##
##c=0
for vcf_line in cov: ###process file with indels
##    if (c % 1000 ==0):print "c ",c
##    c=c+1
    vcf_data = vcf_line.strip().split()
    vcf_chrom=vcf_data[0]
    vcf_pos=int(vcf_data[1])
    vcf_end=int(vcf_data[2])
    coverage = int(vcf_data[3])
    rounded_vcf_position=round_down(vcf_pos) ###round positions to reduce search space
##    print vcf_chrom
    ##    for vcf_line in infile: ###process file with indels
##    if (c % 1000 ==0):print "c ",c
    overlapping = 'false'
    if vcf_chrom in platinum_region and rounded_vcf_position in platinum_region[vcf_chrom]:
        for region in platinum_region[vcf_chrom][rounded_vcf_position]:
            if (vcf_pos == region['start']):# and vcf_end == region['end']):# and (vcf_end > region['start'] and vcf_end < region['end']):
                if vcf_chrom != 'chrX' and vcf_chrom != 'chrY':
                    print vcf_data

文件只是一组间隔开始-结束,第一列 [0] 包含染色体 ex.'chr1':

冠状病毒:

chr1    1   3   AT_rich 3
chr1    5   8   AT_rich 4
chr1    10  12  AT_rich 3

最后一列是区域['length']

文件:

chr1    1   2   4247
chr1    2   3   4244
chr1    3   5   4224
chr1    5   7   4251
chr1    7   8   4251
chr1    8   12   4254
chr1    12   15   4253

输出将是:

chr1    1   2   4247
chr1    2   3   4244
chr1    5   7   4251
chr1    7   8   4251
chr1    8   12   4254## here there isn't really start-start matching position, but there is an overlap between two files
chr1    12   15   4253

所以主要思想是,如果一个文件(cov)的区域从第二个文件(infile)的区域位置开始。打印从这个匹配的起始位置开始的所有位置,直到从第一个文件(cov)开始的区域长度。有时没有精确匹配的位置,只是一些重叠,所以在这种情况下我们可能不关心那些(即使在输出中也有它们会很好)

我想打印从 vcf_data(当条件满足时)到 vcf_data + region['length'] 的行。将其添加到我的代码中的方法是什么?

4

2 回答 2

1

我不太了解您的输入和输出格式,但是根据您的描述,我想您可以执行以下操作:

lines = string.split('\n') # Put the content into array of lines
for idx, line in enumerate(lines): # Iterate over the lines, with the index
    if condition(line): # If the line fulfill a condition
        print lines[idx:idx+length] # Print the line range
于 2013-09-03T10:46:19.047 回答
1

将此条件添加到循环中:

if region_count > 0:
    region_count -= 1
    print line

循环前:

region_count = 0

在“满足条件”内部,但在上面的新条件块之前:

region_count = region['length']
于 2013-09-03T07:42:49.180 回答