示例输入文件(实际输入文件包含大约 50,000 个条目):
615 146
615 180
615 53
615 42
615 52
615 52
615 51
615 45
615 49
616 34
616 44
616 42
616 41
616 42
617 42
617 43
617 42
685 33
685 33
685 33
686 33
686 33
687 47
687 68
737 449
737 41
737 1138
738 46
738 53
我必须将列中的每个值与相同的值(如 615,615,615)进行比较,必须将集群组合在一起,集群必须包含 column1 值,如 146,180.....45,49,然后集群必须打破并形成下一组相同值 616,616,616 的另一个集群。 .........很快
我写的代码是:
from __future__ import division
from sys import exit
h = 0
historyjobs = []
targetjobs = []
def quickzh(zhlistsub,
targetjobs=targetjobs,num=0,denom=0):
li = [] ; ji = []
j = 0
for i in zhlistsub:
x1 = targetjobs[j][0]
x = targetjobs[i][0]
num += x
denom += 1
if x1 >= 0.9 * (num/denom):#to group all items with same value in column 0
li.append(targetjobs[i][1])
else:
break
return li
def filewr(listli):
global h
s = open("newout1","a")
if(len(listli) != 0):
h += 1
s.write("cluster: %d"%h)
s.write("\n")
s.write(str(listli))
s.write("\n\n")
else:
print "0"
def new(inputfile,
historyjobs=historyjobs,targetjobs=targetjobs):
zhlistsub = [];zhlist = []
k = 0
with open(inputfile,'r') as f:
for line in f:
job = map(int,line.split())
targetjobs.append(job)
while True:
if len(targetjobs) != 0:
zhlistsub = [i for i, element in enumerate(targetjobs)]
if zhlistsub:
listrun = quickzh(zhlistsub)
filewr(listrun)
historyjobs.append(targetjobs.pop(0))
k += 1
else:
break
new('newfinal1')
我得到的输出是:
cluster: 1
[146, 180, 53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53]
cluster: 2
[180, 53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53]
cluster: 3
[53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53]
..................so on
但我需要的输出是:
cluster: 1
[146, 180, 53, 42, 52, 52, 51, 45, 49]
cluster: 2
[34, 44, 42, 41, 42]
cluster: 3
[42, 43, 42]
_____________________ so on
那么任何人都可以建议我应该对条件进行哪些更改以获得所需的结果。这真的很有帮助吗?