要在不构建柱状数据库的情况下保存在磁盘上,有:
SQLLite,
HDFS5 : only numeric/fixed string
pickle serialization
csv
csv compressed.
....
只是想知道哪个在速度方面最有效?谢谢
我会考虑羽毛,HDF5。MySQL 或 PostgreSQL - 也可能是一个选项,具体取决于您将如何查询数据......
这是 HDF5 的演示:
In [33]: df = pd.DataFrame(np.random.randint(0, 10**6, (10**4, 3)), columns=list('abc'))
In [34]: df['txt'] = 'X' * 300
In [35]: df
Out[35]:
a b c txt
0 689347 129498 770470 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
1 954132 97912 783288 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
2 40548 938326 861212 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
3 869895 39293 242473 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
4 938918 487643 362942 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
...
In [37]: df.to_hdf('c:/temp/test_str.h5', 'test', format='t', data_columns=['a', 'c'])
In [38]: store = pd.HDFStore('c:/temp/test_str.h5')
In [39]: store.get_storer('test').table
Out[39]:
/test/table (Table(10000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int32Col(shape=(1,), dflt=0, pos=1),
"values_block_1": StringCol(itemsize=300, shape=(1,), dflt=b'', pos=2), # <---- NOTE
"a": Int32Col(shape=(), dflt=0, pos=3),
"c": Int32Col(shape=(), dflt=0, pos=4)}
byteorder := 'little'
chunkshape := (204,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"a": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"c": Index(6, medium, shuffle, zlib(1)).is_csi=False}