How can I use or manipulate (monkey-patch) pandas in order, to keep always the same major-order on the resulting object for copy and groupby aggregations?
I use pandas.DataFrame
as datastructure within a business application (risk model) and need fast aggregation of multidimensional data. Aggregation with pandas depends crucially on the major-ordering scheme in use on the underlying numpy array.
Unfortunatly, pandas (version 0.23.4) changes the major-order of the underlying numpy array when I create a copy or when I perform an aggregation with groupby and sum.
The impact is:
case 1: 17.2 seconds
case 2: 5 min 46 s seconds
on a DataFrame and its copy with 45023 rows and 100000 columns. Aggregation was performed on the index. The index is a pd.MultiIndex
with 15 levels. Aggregation keeps three levels and leads to about 239 groups.
I work typically on DataFrames with 45000 rows and 100000 columns. On the row I have a pandas.MultiIndex
with about 15 levels. To compute statistics on various hierarchy nodes I need to aggregate (sum) on the index dimension.
Aggregation is fast, if the underlying numpy array is c_contiguous
, hence held in column-major-order (C order). It is very slow if it is f_contiguous
, hence in row-major-order (F order).
Unfortunatly, pandas changes the the major-order from C to F when
creating a copy of a DataFrame and even when,
performing an aggregation via a grouby and and taking the sum on the grouper. Hence the resulting DataFrame has a differnt major-order (!)
Sure, I could stick to another 'datamodel', just by keeping the MultiIndex on the columns. Then the current pandas version would always work to my favor. But this is a no go. I think, that one can expect, that for the two operations under consideration (groupby-sum and copy) the major-order should not be changed.
import numpy as np
import pandas as pd
print("pandas version: ", pd.__version__)
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array.flags
print("Numpy array is C-contiguous: ", data.flags.c_contiguous)
dataframe = pd.DataFrame(array, index = pd.MultiIndex.from_tuples([('A', 'U'), ('A', 'V'), ('B', 'W')], names=['dim_one', 'dim_two']))
print("DataFrame is C-contiguous: ", dataframe.values.flags.c_contiguous)
dataframe_copy = dataframe.copy()
print("Copy of DataFrame is C-contiguous: ", dataframe_copy.values.flags.c_contiguous)
aggregated_dataframe = dataframe.groupby('dim_one').sum()
print("Aggregated DataFrame is C-contiguous: ", aggregated_dataframe.values.flags.c_contiguous)
## Output in Jupyter Notebook
# pandas version: 0.23.4
# Numpy array is C-contiguous: True
# DataFrame is C-contiguous: True
# Copy of DataFrame is C-contiguous: False
# Aggregated DataFrame is C-contiguous: False
The major order of the data should be preserved. If pandas likes to switch to an implicit preference, then it should allow to overwrite this. Numpy allows to input the order when creating a copy.
A patched version of pandas should result in
## Output in Jupyter Notebook
# pandas version: 0.23.4
# Numpy array is C-contiguous: True
# DataFrame is C-contiguous: True
# Copy of DataFrame is C-contiguous: True
# Aggregated DataFrame is C-contiguous: True
for the example code snipped above.