2

I'm trying to transition from Pandas into Xarray for N-Dimensional DataArrays to expand my repertoire.

Realistically, I'm going to have a bunch of different pd.DataFrames (in this case row=month, col=attribute) along a particular axis (patients in the mock example below) that I would like to merge (w/o using panels or multindex :), thank you). I want to convert them to xr.DataArrays so I can build dimensions upon them. I made a mock dataset to give a gist of what I'm talking about.

For this dataset I made up, imagine 100 patients, 12 months, 10000 attributes, 3 replicates (per attribute) which would be a typical 4D dataset. Basically, I'm condensing the 3 replicates per attribute by the mean so I end up with a 2D pd.DataFrame (row=months, col=attributes) this DataFrame is the value in my dictionary and the patient it came from is the key (i.e. (patient_x : DataFrame_X) )

I'm also going to include a round about way I did it with np.ndarray placeholder but it would be really convenient if I could generate a N-dimensional DataArray from a dictionary whose key was patient_x and the value was a DataFrame_X

How can I create a N-Dimensional DataArray using Xarray from a dictionary of Pandas DataFrames?

import xarray as xr
import numpy as np
import pandas as pd

np.random.seed(1618033)

#Set dimensions
a,b,c,d = 100,12,10000,3 #100 patients, 12 months, 10000 attributes, 3 replicates

#Create labels
patients = ["patient_%d" % i for i in range(a)]
months = [j for j in range(b)]
attributes = ["attr_%d" % k for k in range(c)]
replicates = [l for l in range(d)]

coords = [patients,months,attributes]
dims = ["Patients","Months","Attributes"]

#Dict of DataFrames
D_patient_DF = dict()

for i, patient in enumerate(patients):
    A_placeholder = np.zeros((b,c))
    for j, month in enumerate(months):
        #Attribute x Replicates
        A_attrReplicates = np.random.random((c,d))
        #Collapse into 1D Vector
        V_attrExp = A_attrReplicates.mean(axis=1)
        #Fill array with row
        A_placeholder[j,:] = V_attrExp
    #Assign dataframe for every patient
    DF_data = pd.DataFrame(A_placeholder, index = months, columns = attributes)
    D_patient_DF[patient] = DF_data

 xr.DataArray(D_patient_DF).dims
#() its empty

D_patient_DF
#{'patient_0':       attr_0    attr_1    attr_2    attr_3    attr_4    attr_5    attr_6  \
# 0   0.445446  0.422018  0.343454  0.140700  0.567435  0.362194  0.563799   
# 1   0.440010  0.548535  0.810903  0.482867  0.469542  0.591939  0.579344   
# 2   0.645719  0.450773  0.386939  0.418496  0.508290  0.431033  0.622270   
# 3   0.555855  0.633393  0.555197  0.556342  0.489865  0.204200  0.823043   
# 4   0.916768  0.590534  0.597989  0.592359  0.484624  0.478347  0.507789   
# 5   0.847069  0.634923  0.591008  0.249107  0.655182  0.394640  0.579700   
# 6   0.700385  0.505331  0.377745  0.651936  0.334216  0.489728  0.282544   
# 7   0.777810  0.423889  0.414316  0.389318  0.565144  0.394320  0.511034   
# 8   0.440633  0.069643  0.675037  0.365963  0.647660  0.520047  0.539253   
# 9   0.333213  0.328315  0.662203  0.594030  0.790758  0.754032  0.602375   
# 10  0.470330  0.419496  0.171292  0.677439  0.683759  0.646363  0.465788   
# 11  0.758556  0.674664  0.801860  0.612087  0.567770  0.801514  0.179939   
4

1 回答 1

5

From a dictionary of DataFrames, you might convert each value into a DataArray (adding dimensions labels), load the results into a Dataset and then convert into a DataArray:

variables = {k: xr.DataArray(v, dims=['month', 'attribute'])
             for k, v in D_patient_DF.items()}
combined = xr.Dataset(variables).to_array(dim='patient')
print(combined)

However, beware that the result will not necessarily be ordered in sorted order, but rather use the arbitrary order of dictionary iteration. If you want sorted order, you should use an OrderedDict instead (insert after setting variables above):

variables = collections.OrderedDict((k, variables[k]) for k in patients)

This outputs:

<xarray.DataArray (patient: 100, month: 12, attribute: 10000)>
array([[[ 0.61176399,  0.26172557,  0.74657302, ...,  0.43742111,
          0.47503291,  0.37263983],
        [ 0.34970732,  0.81527751,  0.53612895, ...,  0.68971198,
          0.68962168,  0.75103198],
        [ 0.71282751,  0.23143891,  0.28481889, ...,  0.52612376,
          0.56992843,  0.3483683 ],
        ...,
        [ 0.84627257,  0.5033482 ,  0.44116194, ...,  0.55020168,
          0.48151353,  0.36374339],
        [ 0.53336826,  0.59566147,  0.45269417, ...,  0.41951078,
          0.46815364,  0.44630235],
        [ 0.25720899,  0.18738289,  0.66639783, ...,  0.36149276,
          0.58865823,  0.33918553]],

       ...,

       [[ 0.42933273,  0.58642504,  0.38716496, ...,  0.45667285,
          0.72684589,  0.52335464],
        [ 0.34946576,  0.35821339,  0.33097093, ...,  0.59037927,
          0.30233665,  0.6515749 ],
        [ 0.63673498,  0.31022272,  0.65788374, ...,  0.47881873,
          0.67825066,  0.58704331],
        ...,
        [ 0.44822441,  0.502429  ,  0.50677081, ...,  0.4843405 ,
          0.84396521,  0.45460029],
        [ 0.61336348,  0.46338301,  0.60715273, ...,  0.48322379,
          0.66530209,  0.52204897],
        [ 0.47520639,  0.43490559,  0.27309414, ...,  0.35280585,
          0.30280485,  0.77537204]]])
Coordinates:
  * month      (month) int64 0 1 2 3 4 5 6 7 8 9 10 11
  * patient    (patient) <U10 'patient_80' 'patient_73' 'patient_79' ...
  * attribute  (attribute) object 'attr_0' 'attr_1' 'attr_2' 'attr_3' ...

Alternatively, you could create a list of 2D DataArrays and then use concat:

patient_list = []
for i, patient in enumerate(patients):
    df = ...
    array = xr.DataArray(df, dims=['patient', 'attribute'])
    patient_list.append(df)
combined = xr.concat(patient_list, dim=pd.Index(patients, name='patient')

This would give the same result, and is probably the cleanest code.

于 2016-04-29T23:11:09.050 回答