6

I have a 100000000x2 array named "a", with an index in the first column and a related value in the second column. I need to get the median values of the numbers in the second column for each index. This is how I colud do it with a for statement:

import numpy as np
b = np.zeros(1000000)
a = np.array([[1, 2],
              [1, 3],
              [2, 3],
              [2, 4],
              [2, 6],
              [1, 4],
              ...
              ...
              [1000000,6]])
for i in xrange(1000000):
    b[i]=np.median(a[np.where(a[:,0]==i),1])

Obviously it's too slow with the for iteration: any suggestions? Thanks

4

5 回答 5

6

This is known as a "group by" operation. Pandas (http://pandas.pydata.org/) is a good tool for this:

import numpy as np
import pandas as pd

a = np.array([[1.0, 2.0],
              [1.0, 3.0],
              [2.0, 5.0],
              [2.0, 6.0],
              [2.0, 8.0],
              [1.0, 4.0],
              [1.0, 1.0],
              [1.0, 3.5],
              [5.0, 8.0],
              [2.0, 1.0],
              [5.0, 9.0]])

# Create the pandas DataFrame.
df = pd.DataFrame(a, columns=['index', 'value'])

# Form the groups.
grouped = df.groupby('index')

# `result` is the DataFrame containing the aggregated results.
result = grouped.aggregate(np.median)
print result

Output:

       value
index       
1        3.0
2        5.5
5        8.5

There are ways to create the DataFrame containing the original data directly, so you don't necessarily have to create the numpy array a first.

More information about the groupby operation in Pandas: http://pandas.pydata.org/pandas-docs/dev/groupby.html

于 2012-09-25T21:06:11.697 回答
4

这有点烦人,但至少您可以==使用排序轻松消除这种烦人(这可能是您的速度杀手)。尝试更多可能不是很有用,但如果你自己排序等可能是可能的:

# First sor the whole thing (probably other ways):
sorter = np.argsort(a[:,0]) # sort by class.
a = a[sorter] # sorted version of a

# Now we need to find where there are changes in the class:
w = np.where(a[:-1,0] != a[1:,0])[0] + 1 # Where the class changes.
# for simplicity, append [0] and [len(a)] to have full slices...
w = np.concatenate([0], w, [len(a)])
result = np.zeros(len(w)-1, dtype=a.dtype)
for i in xrange(0, len(w)-1):
    result[0] = np.median(a[w[i]:w[i+1]])

# If the classes are not exactly 1, 2, ..., N we could add class information:
classes = a[w[:-1],0]

如果你所有的类都是相同的大小,那么 1 和 2 的数量就和 2 一样多。不过还有更好的方法。

编辑:检查 Bitwises 版本以获取避免最后一个 for 循环的解决方案(他还隐藏了一些np.unique您可能更喜欢的代码,因为无论如何速度都无关紧要)。

于 2012-09-25T20:43:41.893 回答
3

这是我的版本,没有 for 也没有额外的模块。这个想法是对数组进行一次排序,然后您只需计算 a 第一列中的索引即可轻松获得中位数的索引:

# sort by first column and then by second
b=a[np.lexsort((a[:,1],a[:,0]))]

# find central value for each index
c=np.unique(b[:,0],return_index=True)[1]
d=np.r_[c,len(a)]
inds=(d[1:]+d[:-1]-1)/2.0
# final result (as suggested by seberg)
medians=np.mean(np.c_[b[np.floor(inds).astype(int),1],
                      b[np.ceil(inds).astype(int),1]],1)

# inds is the index of the median value for each key

如果您愿意,可以缩短代码。

于 2012-09-25T21:24:51.420 回答
2

如果你发现自己非常想做这件事,我建议你看看pandas库,它让这变得像馅饼一样简单:

>>> df = pandas.DataFrame([["A", 1], ["B", 2], ["A", 3], ["A", 4], ["B", 5]], columns=["One", "Two"])
>>> print df
  One  Two
0   A    1
1   B    2
2   A    3
3   A    4
4   B    5
>>> df.groupby('One').median()
      Two
One     
A    3.0
B    3.5
于 2012-09-25T21:07:20.907 回答
1

一种快速的单线方法:

result = [np.median(a[a[:,0]==ii,1]) for ii in np.unique(a[:,0])]

我不相信在不牺牲准确性的情况下,你可以做很多事情来让它更快。但这是另一种尝试,如果您可以跳过排序步骤,可能会更快:

num_in_ind = np.bincount(a[:,0])
results = [np.sort(a[a[:,0]==ii,1])[num_in_ind[ii]/2] for ii in np.unique(a[:,0])]

对于小型阵列,后者稍微快一点。不确定它是否足够快。

于 2012-09-25T21:10:39.420 回答