0

要在不构建柱状数据库的情况下保存在磁盘上,有:

SQLLite, 
HDFS5 : only numeric/fixed string
pickle serialization    
csv
csv compressed.
....

只是想知道哪个在速度方面最有效?谢谢

4

1 回答 1

1

我会考虑羽毛,HDF5。MySQL 或 PostgreSQL - 也可能是一个选项,具体取决于您将如何查询数据......

这是 HDF5 的演示:

In [33]: df = pd.DataFrame(np.random.randint(0, 10**6, (10**4, 3)), columns=list('abc'))

In [34]: df['txt'] = 'X' * 300

In [35]: df
Out[35]:
           a       b       c                                                txt
0     689347  129498  770470  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
1     954132   97912  783288  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
2      40548  938326  861212  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
3     869895   39293  242473  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
4     938918  487643  362942  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
...

In [37]: df.to_hdf('c:/temp/test_str.h5', 'test', format='t', data_columns=['a', 'c'])

In [38]: store = pd.HDFStore('c:/temp/test_str.h5')

In [39]: store.get_storer('test').table
Out[39]:
/test/table (Table(10000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int32Col(shape=(1,), dflt=0, pos=1),
  "values_block_1": StringCol(itemsize=300, shape=(1,), dflt=b'', pos=2),  # <---- NOTE
  "a": Int32Col(shape=(), dflt=0, pos=3),
  "c": Int32Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (204,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "a": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "c": Index(6, medium, shuffle, zlib(1)).is_csi=False}
于 2016-11-11T11:28:16.583 回答