3

我有一个numpy structured array具有整数和浮点数的 a ,我用它来初始化 a pandas DataFrame

In [497]: x = np.ones(100000000, dtype=[('f0', '<i8'), ('f1', '<f8'),('f2','<i8'),('f3', '<f8'),('f4', '<f8'),('f5', '<f8'),('f6', '<f8'),('f7', '<f8')])

In [498]: %timeit pd.DataFrame(x)
The slowest run took 4.07 times longer than the fastest. This could mean that an intermediate result is being cached 

In [498]: 1 loops, best of 3: 2min 26s per loop


In [499]: xx=x.view(np.float64).reshape(x.shape + (-1,))

In [500]: %timeit pd.DataFrame(xx)
1 loops, best of 3: 256 ms per loop

从上面的代码可以看出,DataFrame用 a初始化structured array是很慢的。但是,如果我将数据更改为连续的浮点 numpy 数组,它会很快。但我仍然需要DataFrame混合浮点数和整数。

经过更多测试后,我意识到 DataFrame 实际上是在复制整个数据(使用浮动视图进行初始化structured array时不会发生这种情况)。structured array我在这里找到了更多信息:https ://github.com/pydata/pandas/issues/9216

有没有办法加快初始化并避免复制?我对替代方法持开放态度,但数据来自structured array.

4

0 回答 0