python - 平均值取决于相对于第二个变量的分箱

Question

我正在使用 python / numpy。作为输入数据，我有大量的值对(x,y)。我基本上想绘制，即某个数据仓<y>(x)的平均值。目前我使用一个普通的循环来实现这一点，这非常慢。yxfor

# create example data
x = numpy.random.rand(1000)
y = numpy.random.rand(1000)
# set resolution
xbins = 100
# find x bins
H, xedges, yedges = numpy.histogram2d(x, y, bins=(xbins,xbins) )
# calculate mean and std of y for each x bin
mean = numpy.zeros(xbins)
std = numpy.zeros(xbins)
for i in numpy.arange(xbins):
    mean[i] = numpy.mean(y[ numpy.logical_and( x>=xedges[i], x<xedges[i+1] ) ])
    std[i]  = numpy.std (y[ numpy.logical_and( x>=xedges[i], x<xedges[i+1] ) ])

是否有可能为它提供一种矢量化的写作？

score 15 · Accepted Answer

你把事情不必要地复杂化了。您需要知道的是，对于中的每个 bin x，什么是和n，该 bin 中的值的数量、这些值的总和以及它们的平方和。你可以得到这些：sysy2yxy

>>> n, _ = np.histogram(x, bins=xbins)
>>> sy, _ = np.histogram(x, bins=xbins, weights=y)
>>> sy2, _ = np.histogram(x, bins=xbins, weights=y*y)

从那些：

>>> mean = sy / n
>>> std = np.sqrt(sy2/n - mean*mean)

score 1 · Accepted Answer

如果你可以使用熊猫：

import pandas as pd
xedges = np.linspace(x.min(), x.max(), xbins+1)
xedges[0] -= 0.00001
xedges[-1] += 0.000001
c = pd.cut(x, xedges)
g = pd.groupby(pd.Series(y), c.labels)
mean2 = g.mean()
std2 = g.std(0)

python - 平均值取决于相对于第二个变量的分箱

2 回答 2

Related

Reference