7

个人(索引从 0 到 5)在两个位置之间进行选择:A 和 B。我的数据具有广泛的格式,其中包含因个人而异的特征 (ind_var) 和仅因位置而异的特征 (location_var)。

例如,我有:

In [281]:

df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

df_reshape_test

Out[281]:
    dist_to_A   dist_to_B   ind_var location location_var
0    0            50             3   A       10
1    0            50             8   A       10
2    0            50            10   A       10
3    50           0              1   B       14
4    50           0              3   B       14
5    50           0              4   B       14

变量“位置”是个人选择的。dist_to_A 是从个人选择的位置到位置 A 的距离(与 dist_to_B 相同)

我希望我的数据具有这种形式:

    choice  dist_S  ind_var location    location_var
0    1        0       3         A           10
0    0       50       3         B           14
1    1        0       8         A           10
1    0       50       8         B           14
2    1        0      10         A           10
2    0       50      10         B           14
3    0       50       1         A           10
3    1        0       1         B           14
4    0       50       3         A           10
4    1        0       3         B           14
5    0       50       4         A           10
5    1        0       4         B           14

其中choice == 1 表示个人选择了该位置,dist_S 是与所选位置的距离。

我阅读了有关.stack方法的信息,但不知道如何将其应用于这种情况。谢谢你的时间!

注意:这只是一个简单的例子。我正在寻找的数据集具有不同数量的位置和每个位置的个人数量,因此如果可能,我正在寻找一个灵活的解决方案

4

3 回答 3

6

其实pandas有一个wide_to_long命令可以方便的做你想做的事。

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 
                'dist_to_A' : [0, 0, 0, 50, 50, 50], 
                'dist_to_B' : [50, 50, 50, 0, 0, 0], 
                'location_var': [10, 10, 10, 14, 14, 14], 
                'ind_var': [3, 8, 10, 1, 3, 4]})

df['ind'] = df.index

#The `location` and `location_var` corresponds to the choices, 
#record them as dictionaries and drop them 
#(Just realized you had a cleaner way, copied from yous). 

ind_to_loc = dict(df['location'])
loc_dict = dict(df.groupby('location').agg(lambda x : int(np.mean(x)))['location_var'])
df.drop(['location_var', 'location'], axis = 1, inplace = True)
# now reshape
df_long = pd.wide_to_long(df, ['dist_to_'], i = 'ind', j = 'location') 

# use the dictionaries to get variables `choice` and `location_var` back.

df_long['choice'] = df_long.index.map(lambda x: ind_to_loc[x[0]])
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]])
print df_long.sort()

这为您提供了您要求的表格:

              ind_var  dist_to_ choice  location_var
ind location                                        
0   A               3         0      A            10
    B               3        50      A            14
1   A               8         0      A            10
    B               8        50      A            14
2   A              10         0      A            10
    B              10        50      A            14
3   A               1        50      B            10
    B               1         0      B            14
4   A               3        50      B            10
    B               3         0      B            14
5   A               4        50      B            10
    B               4         0      B            14

当然,您可以生成一个选择变量,0如果1那是您想要的。

于 2014-03-28T00:02:37.067 回答
3

我有点好奇你为什么喜欢这种格式。可能有更好的方法来存储您的数据。但是这里。

In [137]: import numpy as np

In [138]: import pandas as pd

In [139]: df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B
', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 
0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

In [140]: print(df_reshape_test)
   dist_to_A  dist_to_B  ind_var location  location_var
0          0         50        3        A            10
1          0         50        8        A            10
2          0         50       10        A            10
3         50          0        1        B            14
4         50          0        3        B            14
5         50          0        4        B            14

In [141]: # Get the new axis separately:

In [142]: idx = pd.Index(df_reshape_test.index.tolist() * 2)

In [143]: df2 = df_reshape_test[['ind_var', 'location', 'location_var']].reindex(idx)

In [144]: print(df2)
   ind_var location  location_var
0        3        A            10
1        8        A            10
2       10        A            10
3        1        B            14
4        3        B            14
5        4        B            14
0        3        A            10
1        8        A            10
2       10        A            10
3        1        B            14
4        3        B            14
5        4        B            14

In [145]: # Swap the location for the second half

In [146]: # replace any 6 with len(df) / 2 + 1 if you have more rows.d 

In [147]: df2['choice'] = [1] * 6 + [0] * 6  # may need to play with this.

In [148]: df2.iloc[6:].location.replace({'A': 'B', 'B': 'A'}, inplace=True)

In [149]: df2 = df2.sort()

In [150]: df2['dist_S'] = np.abs((df2.choice - 1) * 50)

In [151]: print(df2)
   ind_var location  location_var  choice  dist_S
0        3        A            10       1       0
0        3        B            10       0      50
1        8        A            10       1       0
1        8        B            10       0      50
2       10        A            10       1       0
2       10        B            10       0      50
3        1        B            14       1       0
3        1        A            14       0      50
4        3        B            14       1       0
4        3        A            14       0      50
5        4        B            14       1       0
5        4        A            14       0      50

它不会很好地概括,但可能有替代(更好)的方法来绕过更丑陋的部分,比如生成选择 col。

于 2013-07-16T23:18:01.367 回答
2

好的,这比我预期的要花更长的时间,但这是一个更通用的答案,适用于每个人的任意数量的选择。我确信有更简单的方法,所以如果有人可以为以下一些代码提供更好的方法,那就太好了。

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

这使

    dist_to_A   dist_to_B   ind_var location   location_var
0    0           50          3     A            10
1    0           50          8     A            10
2    0           50         10     A            10
3    50          0           1     B            14
4    50          0           3     B            14
5    50          0           4     B            14

然后我们这样做:

df.index.names = ['ind']

# Add choice var

df['choice'] = 1

# Create dictionaries we'll use later

ind_to_loc = dict(df['location'])
# gives ind_to_loc equal to {0 : 'A', 1 : 'A', 2 : 'A', 3 : 'B', 4 : 'B', 5: 'B'}

ind_dict = dict(df['ind_var'])
#gives  { 0: 3, 1 : 8, 2 : 10, 3: 1, 4 : 3, 5: 4}

loc_dict = dict(  df.groupby('location').agg(lambda x : int(np.mean(x)) )['location_var']  )
# gives  {'A' : 10, 'B' : 14}

现在我创建一个多索引并重新索引以获得一个长形状

df = df.set_index( [df.index, df['location']] )

df.index.names = ['ind', 'location']

# re-index to long shape

loc_list = ['A', 'B']
ind_list = [0, 1, 2, 3, 4, 5]
new_shape = [  (ind, loc) for ind in ind_list for loc in loc_list]
idx = pd.Index(new_shape)
df_long = df.reindex(idx, method = None)
df_long.index.names = ['ind', 'loc']

长的形状是这样的:

         dist_to_A  dist_to_B  ind_var location  location_var  choice
ind loc                                                              
0   A            0         50        3        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
1   A            0         50        8        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
2   A            0         50       10        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
3   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        1        B            14       1
4   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        3        B            14       1
5   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        4        B            14       1

所以现在用字典填充 NaN 值:

df_long['ind_var'] = df_long.index.map(lambda x : ind_dict[x[0]] )
df_long['location']  = df_long.index.map(lambda x : ind_to_loc[x[0]] )
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]] )

# Fill in choice
df_long['choice'] = df_long['choice'].fillna(0)

最后,剩下的就是创建 dist_S
我会在这里作弊并假设我可以创建一个像这样的嵌套字典

nested_loc = {'A' : {'A' : 0, 'B' : 50}, 'B' : {'A' : 50, 'B' : 0}}

(内容为:如果您在位置 A,则位置 A 位于 0 公里处,位置 B 位于 50 公里处)

def nested_f(x):    
    return nested_loc[x[0]][x[1]]

df_long = df_long.reset_index()
df_long['dist_S'] = df_long[['loc', 'location']].apply(nested_f, axis=1)

df_long = df_long.drop(['dist_to_A', 'dist_to_B', 'location'], axis = 1 )

df_long

给出想要的结果

    ind loc ind_var location_var    choice  dist_S
0    0   A   3         10            1      0
1    0   B   3         14            0      50
2    1   A   8         10            1      0
3    1   B   8         14            0      50
4    2   A   10        10            1      0
5    2   B   10        14            0      50
6    3   A   1         10            0      50
7    3   B   1         14            1      0
8    4   A   3         10            0      50
9    4   B   3         14            1      0
10   5   A   4         10            0      50
11   5   B   4         14            1      0
于 2013-07-19T20:57:51.117 回答