解决该问题的一种方法是将散点图/点图/蜂群图中的每个“行”视为直方图中的一个 bin:
data = np.random.randn(100)
width = 0.8 # the maximum width of each 'row' in the scatter plot
xpos = 0 # the centre position of the scatter plot in x
counts, edges = np.histogram(data, bins=20)
centres = (edges[:-1] + edges[1:]) / 2.
yvals = centres.repeat(counts)
max_offset = width / counts.max()
offsets = np.hstack((np.arange(cc) - 0.5 * (cc - 1)) for cc in counts)
xvals = xpos + (offsets * max_offset)
fig, ax = plt.subplots(1, 1)
ax.scatter(xvals, yvals, s=30, c='b')
这显然涉及对数据进行分箱,因此您可能会失去一些精度。如果您有离散数据,则可以替换:
counts, edges = np.histogram(data, bins=20)
centres = (edges[:-1] + edges[1:]) / 2.
和:
centres, counts = np.unique(data, return_counts=True)
保留精确 y 坐标(即使对于连续数据)的另一种方法是使用核密度估计来缩放 x 轴上随机抖动的幅度:
from scipy.stats import gaussian_kde
kde = gaussian_kde(data)
density = kde(data) # estimate the local density at each datapoint
# generate some random jitter between 0 and 1
jitter = np.random.rand(*data.shape) - 0.5
# scale the jitter by the KDE estimate and add it to the centre x-coordinate
xvals = 1 + (density * jitter * width * 2)
ax.scatter(xvals, data, s=30, c='g')
for sp in ['top', 'bottom', 'right']:
ax.spines[sp].set_visible(False)
ax.tick_params(top=False, bottom=False, right=False)
ax.set_xticks([0, 1])
ax.set_xticklabels(['Histogram', 'KDE'], fontsize='x-large')
fig.tight_layout()
第二种方法大致基于小提琴图的工作原理。它仍然不能保证没有任何点重叠,但我发现在实践中,只要有相当数量的点(> 20),它往往会给出相当漂亮的结果,并且分布可以相当好地近似由高斯之和。