python - 快速分类（分箱）

Question

我有大量的条目，每一个都是一个浮点数。这些数据x可以通过迭代器访问。我需要使用选择对所有条目进行分类，例如10<y<=20, 20<y<=50, ....y来自其他可迭代对象的数据在哪里。条目的数量远远超过选择的数量。最后，我想要一本字典，例如：

{ 0: [all events with 10<x<=20],
  1: [all events with 20<x<=50], ... }

或类似的东西。例如我在做：

for x, y in itertools.izip(variable_values, binning_values):
    thebin = binner_function(y)
    self.data[tuple(thebin)].append(x)

一般来说y是多维的。

这很慢，是否有更快的解决方案，例如使用 numpy？我认为问题来自list.append我正在使用的方法，而不是来自binner_function

score 3 · Accepted Answer

A fast way to get the assignments in numpy is using np.digitize:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html

You'd still have to split the resulting assignments up into groups. If x or y is multidimensional, you will have to flatten the arrays first. You could then get the unique bin assignments, and then iterate over those in conjunction with np.where to split the the assigments up into groups. This will probably be faster if the number of bins is much smaller than the number of elements that need to be binned.

As a somewhat trivial example that you will need to tweak/elaborate on for your particular problem (but is hopefully enough to get you started with with a numpy solution):

In [1]: import numpy as np

In [2]: x = np.random.normal(size=(50,))

In [3]: b = np.linspace(-20,20,50)

In [4]: assign = np.digitize(x,b)

In [5]: assign
Out[5]: 
array([23, 25, 25, 25, 24, 26, 24, 26, 23, 24, 25, 23, 26, 25, 27, 25, 25,
       25, 25, 26, 26, 25, 25, 26, 24, 23, 25, 26, 26, 24, 24, 26, 27, 24,
       25, 24, 23, 23, 26, 25, 24, 25, 25, 27, 26, 25, 27, 26, 26, 24])

In [6]: uid = np.unique(assign)

In [7]: adict = {}

In [8]: for ii in uid:
   ...:     adict[ii] = np.where(assign == ii)[0]
   ...:     

In [9]: adict
Out[9]: 
{23: array([ 0,  8, 11, 25, 36, 37]),
 24: array([ 4,  6,  9, 24, 29, 30, 33, 35, 40, 49]),
 25: array([ 1,  2,  3, 10, 13, 15, 16, 17, 18, 21, 22, 26, 34, 39, 41, 42, 45]),
 26: array([ 5,  7, 12, 19, 20, 23, 27, 28, 31, 38, 44, 47, 48]),
 27: array([14, 32, 43, 46])}

For dealing with flattening and then unflattening numpy arrays, see: http://docs.scipy.org/doc/numpy/reference/generated/numpy.unravel_index.html

http://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel_multi_index.html

score 0 · Accepted Answer

np.searchsorted是你的朋友。正如我在这里某处阅读同一主题的另一个答案时，它目前比数字化要快得多，并且做同样的工作。

http://docs.scipy.org/doc/numpy/reference/generated/numpy.searchsorted.html

python - 快速分类（分箱）

2 回答 2

Related

Reference