由于 pandas 0.10.1 可以使用 HDFStore 预选磁盘:
import pandas as pd
import numpy.random as rd
df = pd.DataFrame(rd.randn(int(1e6)).reshape(int(1e5), 10), columns=list('abcdefghij'))
store = pd.HDFStore('newstore.h5')
# only data columns can serve as indices to select for on-disk, but there's a
# speed penalty involved, so it's a conscious decision what becomes data_column!
store.append('df', df, data_columns=['a','b'])
以下事情发生在“磁盘上”(并且非常酷!;)
In [14]: store.select('df', ['a > 0', 'b > 0'])
Out[14]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24747 entries, 2 to 99998
Data columns:
a 24747 non-null values
b 24747 non-null values
c 24747 non-null values
d 24747 non-null values
e 24747 non-null values
f 24747 non-null values
g 24747 non-null values
h 24747 non-null values
i 24747 non-null values
j 24747 non-null values
dtypes: float64(10)
In [15]: store.select('df', ['a > 0'])
Out[15]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50043 entries, 0 to 99999
Data columns:
a 50043 non-null values
b 50043 non-null values
c 50043 non-null values
d 50043 non-null values
e 50043 non-null values
f 50043 non-null values
g 50043 non-null values
h 50043 non-null values
i 50043 non-null values
j 50043 non-null values
dtypes: float64(10)
因此,您现在要做的就是增加数据框的维度数量,并亲自查看它是否足够快以满足您的需求。这很容易玩!