我将假设您希望基因名称按碱基对中的距离进行分类:
from collections import defaultdict, Counter
bins = defaultdict(Counter)
binsize = 50
with open(datafile) as inf:
for line in inf:
data = line.split('<', 1)[0]
offset, name = data.split()
bins[int(offset)//binsize][name] += 1
然后
keys = sorted(bins)
for key in keys:
values = ', '.join('{1} {0}'.format(a,b) for a,b in bins[key].most_common())
print('{:>7} - {:>7} : {}'.format(binsize*key, binsize*(key+1)-1, values))
在您的样本数据上导致
-23350 - -23301 : 1 MIR198
-19750 - -19701 : 1 PRPS2
-12150 - -12101 : 1 SLC7A5
-11650 - -11601 : 1 CAMK2G
-9050 - -9001 : 1 KIR3DX1
-300 - -251 : 1 ARAP1
-100 - -51 : 1 CCDC88A, 1 SLC12A6
8000 - 8049 : 1 C14orf79
10000 - 10049 : 1 LOC100506172
12150 - 12199 : 1 MMP14
65950 - 65999 : 1 EFNB1