1

我有一个排序列表——实际上是一个按 x 排序的 (x,y,z) 三元组的巨大数组。我的目标是根据 x 的范围将它分成几部分。我一直在努力

for triple in hugelist:  
    while triple[0] >= minx and triple[0] < maxx:  
        #do some stuff  
    # when out of that range, increase endpoints to the next range  
    minx = minx + deltax  
    maxx = maxx + deltax  
    # do some other stuff  
    # and hopefully move to next triple  

现在当然这不起作用,因为我误用了 while,我明白为什么。但是,我不知道如何通过列表。hugelist 是大约 200 万个三元组,可以分成大约 600 个块。如果可能的话,我希望按顺序通过它一次。

===============================

在 Tim 的帮助下,使用 291 点迷你列表,bisect 错过了 maxx 应该去的地方:

while xstart < len(heights):   
    xfinish = bisect.bisect_left(heights, (maxx, 0, 0), lo=xstart)    
    xslice = heights[xstart:xfinish]  
    print "xstart is ", xstart, " xfinish is ", xfinish  
    print "maxx is ", maxx, " xslice is ", xslice  

    maxx += deltax   
    xstart = xfinish  


xstart is  0  xfinish is  291  
maxx is  804.0  xslice is  [(803.01, 1941.84, 0.74) (803.04, 1941.88, 0.45) (803.06, 1941.25, 0.0)
 (803.07, 1941.01, 0.0) (803.07, 1941.52, 0.31) (803.09, 1941.16, 0.08)
 (803.12, 1940.05, 0.0) (803.13, 1939.72, 0.3) (803.13, 1939.86, 0.11)
 (803.13, 1940.29, 0.17)  . . .  (803.23, 1938.24, 0.2)
 (803.23, 1938.25, 0.45) (803.23, 1938.29, 0.1) (803.23, 1938.36, 0.0)
 (803.23, 1938.49, 0.0) (803.96, 1941.06, 4.21) (**803.98**, 1940.6, 4.55)
 (**804.0**, 1940.32, 4.49) (**804.01**, 1940.68, 4.6) . . .  (806.11, 1934.82, 10.64)
 (806.11, 1934.86, 10.65) (806.11, 1934.91, 10.56) (806.32, 1933.24, 4.69)]
4

4 回答 4

2

这是一种不同的、更有效的方法,利用列表已排序:

from bisect import bisect_left

istart = 0
while istart < len(hugelist):
    ifinish = bisect_left(hugelist, (maxx, 0, 0), lo=istart)
    # Now work on the slice hugelist[istart:ifinish].
    # It's possible that istart == ifinish, i.e. that the
    # slice is empty!
    maxx += deltax
    istart = ifinish

使用二分搜索将减少所需的比较次数。

编辑:来自评论:

如果您将列表索引视为元素之间的指向,则非常清楚 ,最左侧元素的“左侧”为 0,len(hugelist) 最右侧元素的“右侧”为 0。然后bisect_left()返回第一个元素为 >= 的第一个三元组之前的位置maxx

一个例子应该真的有帮助:

hugelist = [(0,0,0), (1,0,0), (3,0,0), (4,1,1), (4,2,2), (5,0,0)]
maxx = 0
deltax = 1
istart = 0
while istart < len(hugelist):
    ifinish = bisect_left(hugelist, (maxx, 0, 0), lo=istart)
    # Now work on the slice hugelist[istart:ifinish].
    # It's possible that istart == ifinish, i.e. that the
    # slice is empty!
    print "for maxx =", maxx, hugelist[istart:ifinish]
    maxx += deltax
    istart = ifinish

和输出:

for maxx = 0 []
for maxx = 1 [(0, 0, 0)]
for maxx = 2 [(1, 0, 0)]
for maxx = 3 []
for maxx = 4 [(3, 0, 0)]
for maxx = 5 [(4, 1, 1), (4, 2, 2)]
for maxx = 6 [(5, 0, 0)]

这主要显示了endcase,这是任何理智的读者都会担心的;-)

于 2013-11-03T02:46:33.293 回答
1

您可以简单地使用if来检查是否triple[0]在所需范围内。不需要内部循环。如果列表按 x 值排序,则无需与最小值进行比较;只需检查它是否低于最大值。

for triple in hugelist:  
    if triple[0] < maxx:  
        #do some stuff  
    else:
        maxx = maxx + deltax  
        # do some other stuff  

根据您要执行的操作,您还可以查看itertools.groupby

编辑:如果正如您在评论中所说的那样,目标是获取每个范围内 z 值的差异,那么您可以执行以下操作:

z_variances = []
z_group = []
maxx = deltax
for x, y, z in huge_list:
    if x < maxx:
        z_group.append(z)
    else:
        z_variances.append(var(z_group))
        z_group = [z]
        maxx += deltax

或使用groupby

z_variances = []
for _, group in itertools.groupby(huge_list, lambda x: int(x / deltax)):
    z_variances.append(var(z for x, y, z in group))
于 2013-11-03T02:24:07.987 回答
1

首先,创建一个示例 numpy 数组:

>>> alen=300000
>>> huge=np.arange(alen).reshape(alen/3,3)
>>> huge
array([[     0,      1,      2],
       [     3,      4,      5],
       [     6,      7,      8],
       ..., 
       [299991, 299992, 299993],
       [299994, 299995, 299996],
       [299997, 299998, 299999]])

此语法将为您提供第一列:

>>> huge[:,0]
array([     0,      3,      6, ..., 299991, 299994, 299997])

由于您声明子数组已排序,因此您可以使用 numpy.searchsorted 将较大的数组分成存储桶。

让我们分成三部分:

>>> minx=huge[-1][0]/3
>>> maxx=huge[-1][0]*2/3
>>> minx
99999
>>> maxx
199998

只需使用 np.searchsorted 针对您想要的范围内的三元组测试您想要的条件:

>>> np.searchsorted(huge[:,0],[minx,maxx])
array([33333, 66666])

然后切成huge所需的桶:

>>> buckets=np.searchsorted(huge[:,0],[minx,maxx])
>>> bucket1=huge[0:buckets[0]]
>>> bucket2=huge[buckets[0]:buckets[1]]
>>> bucket3=huge[buckets[1]:]
>>> bucket1
array([[    0,     1,     2],
       [    3,     4,     5],
       [    6,     7,     8],
       ..., 
       [99990, 99991, 99992],
       [99993, 99994, 99995],
       [99996, 99997, 99998]])
>>> bucket2
array([[ 99999, 100000, 100001],
       [100002, 100003, 100004],
       [100005, 100006, 100007],
       ..., 
       [199989, 199990, 199991],
       [199992, 199993, 199994],
       [199995, 199996, 199997]])
>>> bucket3
array([[199998, 199999, 200000],
       [200001, 200002, 200003],
       [200004, 200005, 200006],
       ..., 
       [299991, 299992, 299993],
       [299994, 299995, 299996],
       [299997, 299998, 299999]])

您还可以使用 np.histogram:

>>> edges=np.histogram(huge[:,0],[0,minx,maxx,huge[-1][0]])[1]
>>> b1=huge[edges[0]:edges[1]]
>>> b2=huge[edges[1]:edges[2]]
>>> b3=huge[edges[2]:edges[3]]
于 2013-11-03T02:27:27.243 回答
0

如果您只想“最多 x”,请使用itertools.takewhile

import itertools

li = [(1,2,3),(4,5,6),(7,8,9),(10,11,12),(13,14,15)]

list(itertools.takewhile(lambda x: x[0] < 10,li))
Out[78]: [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

如果你想为整个集合指定组,它是itertools.groupby

def grouper(x):
    if x < 5:
        return 0
    if x < 11:
        return 1
    return 2

for i,g in itertools.groupby(li,lambda x: grouper(x[0])):
    print('group {}: {}'.format(i,list(g)))

group 0: [(1, 2, 3), (4, 5, 6)]
group 1: [(7, 8, 9), (10, 11, 12)]
group 2: [(13, 14, 15)]
于 2013-11-03T02:32:02.213 回答