pandas - 获取均值的正确方法，从 pandas 的大数据集中描述值

Question

对于 hdf_read，我得到“数组太大”，这可能意味着在将它们组合在一起之前，我必须遍历文件并以块的形式计算结果；我想知道是否有一种自动化的方式来做到这一点？或者也许是我不知道的更好的方法？

任何建议都会非常有帮助！

现在我使用以下内容加载文件：

res= pd.read_hdf(self.file, self.key, columns = get_columns)

其次是计算平均值：

describe = res.describe()
text=''
count = int(describe['count'])
text+= 'Count: %s\n' % (str(count))
text+= 'Mean: %s\n' % (str(describe['mean']))
text+= 'Standard Deviation: %s\n' % (str(describe['std']))
text+= 'Range: [%s, %s]\n' % (str(int(describe['min'])), str(int(describe['max'])))
text+= "25%%: %s\n" % (str(int(describe['25%'])))
text+= "50%% (median): %s\n" % (str(int(describe['50%'])))
text+= "75%%: %s\n" % (str(int(describe['75%'])))
text+= "Unbiased Kurtosis: %s\n" % (str(res.kurt()))
text+= "Unbiased Skew: %s\n" % (str(res.skew()))
text+= "Unbiased Variance: %s\n" % (str(res.var()))

在 HDF 文件（blosc 中 812MB）上运行它，产生

res= pd.read_hdf(self.file, self.key, columns = get_columns)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 330, in read_hdf
    return f(store, True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 322, in <lambda>
    key, auto_close=auto_close, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 669, in select
    auto_close=auto_close).get_values()
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 1335, in get_values
    results = self.func(self.start, self.stop)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 658, in func
    columns=columns, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 3822, in read
    if not self.read_axes(where=where, **kwargs):
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 3056, in read_axes
    values = self.selection.select()
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 4339, in select
    return self.table.table.read(start=self.start, stop=self.stop)
  File "/usr/lib/python2.7/dist-packages/tables/table.py", line 1975, in read
    arr = self._read(start, stop, step, field, out)
  File "/usr/lib/python2.7/dist-packages/tables/table.py", line 1865, in _read
    result = self._get_container(nrows)
  File "/usr/lib/python2.7/dist-packages/tables/table.py", line 958, in _get_container
    return numpy.empty(shape=shape, dtype=self._v_dtype)
ValueError: array is too big.

pd.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 32
OS: Linux
OS-release: 3.13.0-24-generic
machine: i686
processor: i686
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: 1.3.1
Cython: 0.20.1post0
numpy: 1.8.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 1.2.1
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: 0.8.4
pymysql: None
psycopg2: None

ptdump：这里

pandas - 获取均值的正确方法，从 pandas 的大数据集中描述值

0 回答 0

Related

Reference