2

假设我想创建一个列表的箱线图,其中包含数字 1-5 大约一百万次。

这样的列表的大小约为 5 000 000,但表示为 dict 它根本不占用空间:

s = {1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5:1000000}

问题是,如果我尝试创建该字典的箱线图,我会收到错误

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    ax.boxplot(s)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/matplotlib/axes.py", line 5462, in boxplot
    if not hasattr(x[0], '__len__'):
KeyError: 0

有没有一种巧妙的方法来绘制字典s,而不必将所有元素放在一个列表中?


一条评论建议我试试

boxplot(n for n, count in s.iteritems() for _ in xrange(count))

但这导致

Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    boxplot(n for n, count in s.iteritems() for _ in xrange(count))
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/matplotlib/pyplot.py", line 2134, in boxplot
    ret = ax.boxplot(x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/matplotlib/axes.py", line 5462, in boxplot
    if not hasattr(x[0], '__len__'):
TypeError: 'generator' object has no attribute '__getitem__'
4

2 回答 2

4

使用图片来描述数据的全部意义在于从整体上了解数据,而不是非常精确。因此,通过为每 1000 个实际数据点生成一个代表性数据点来压缩数据不会有太大危害:

x = [val for val, num in s.items() for i in range(num//1000)]

这对于肉眼来说应该已经足够了:

import matplotlib.pyplot as plt
import numpy as np
s = {1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5:1000000}
x = [val for val, num in s.items() for i in range(num//1000)]
dct = plt.boxplot(x)
plt.show()
于 2012-11-12T17:48:56.453 回答
2

据我所知,matplotlib 没有此类数据的方法。基本上,您必须计算相关统计数据并实施您自己的绘制箱线图的方法。这可能会让你开始:

import matplotlib.pyplot as plt
import numpy as np


s = [{1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5:1000000},
     {1: 1000000, 0: 1000000, 8: 1000000, 3: 1000000, 7:1000000}]

def boxplot(data, x=0):

    sorted_data = np.array(data.items())
    sorted_data = np.sort(sorted_data, 0)
    values = sorted_data[:,0]
    freqs = sorted_data[:,1]
    freqs = np.cumsum(freqs)
    freqs = freqs*1./np.max(freqs)

    #get 25%, 50%, 75% percentiles
    idx = np.searchsorted(freqs, [0.25, 0.5, 0.75])
    p25, p50, p75 = values[idx]
    vmin, vmax = values.min(), values.max()

    ax = plt.gca()
    l,r = -0.2+x, 0.2+x
    #plot boxes
    plt.plot([l,r], [p50, p50], 'k')
    plt.plot([l, r, r, l, l], [p25, p25, p75, p75, p25], 'k')
    plt.plot([x,x], [p75, vmax], 'k')
    plt.plot([x,x], [p25, vmin], 'k')

for i in range(len(s)):
    boxplot(s[i],i)
plt.xlim(-0.5,1.5)
plt.show()
于 2012-11-12T17:30:06.580 回答