从你的例子看来scatterplot
,你有很多观点。将这些绘制为单独的数据将覆盖大部分数据,并且仅显示“顶部”数据。这是不好的做法,当你有这么多数据时,做一些聚合会改善视觉表现。
下面的示例显示了如何bin
使用 2d 直方图平均数据。一旦您的数据采用适当的格式进行可视化显示,将结果绘制为图像或轮廓是相当简单的。
在绘图之前聚合数据还可以提高性能并防止Array Too Big
或内存相关的错误。
fig, ax = plt.subplots(1, 3, figsize=(15,5), subplot_kw={'aspect': 1})
n = 100000
x = np.random.randn(n)
y = np.random.randn(n)+5
data_values = y * x
# Normal scatter, like your example
ax[0].scatter(x, y, c=data_values, marker='x', alpha=.2)
ax[0].set_xlim(-5,5)
# Get the extent to scale the other plots in a similar fashion
xrng = list(ax[0].get_xbound())
yrng = list(ax[0].get_ybound())
# number of bins used for aggregation
n_bins = 130.
# create the histograms
counts, xedge, yedge = np.histogram2d(x, y, bins=(n_bins,n_bins), range=[xrng,yrng])
sums, xedge, yedge = np.histogram2d(x, y, bins=(n_bins,n_bins), range=[xrng,yrng], weights=data_values)
# gives a warning when a bincount is zero
data_avg = sums / counts
ax[1].imshow(data_avg.T, origin='lower', interpolation='none', extent=xrng+yrng)
xbin_size = (xrng[1] - xrng[0]) / n_bins # the range divided by n_bins
ybin_size = (yrng[1] - yrng[0]) / n_bins # the range divided by n_bins
# create x,y coordinates for the histogram
# coordinates should be shifted from edge to center
xgrid, ygrid = np.meshgrid(xedge[1:] - (xbin_size / 2) , yedge[1:] - (ybin_size / 2))
ax[2].contourf(xgrid, ygrid, data_avg.T)
ax[0].set_title('Scatter')
ax[1].set_title('2D histogram with imshow')
ax[2].set_title('2D histogram with contourf')
