5

在 python 2.7.3 上使用 pandas 0.11 我正在尝试使用以下值旋转一个简单的数据框:

    StudentID QuestionID Answer DateRecorded
0        1234        bar      a   2012/01/21
1        1234        foo      c   2012/01/22
2        4321        bop      a   2012/01/22
3        5678        bar      a   2012/01/24
4        8765        baz      b   2012/02/13
5        4321        baz      b   2012/02/15
6        8765        bop      b   2012/02/16
7        5678        bop      c   2012/03/15
8        5678        foo      a   2012/04/01
9        1234        baz      b   2012/04/11
10       8765        bar      a   2012/05/03
11       4321        bar      a   2012/05/04
12       5678        baz      c   2012/06/01
13       1234        bar      b   2012/11/01

我正在使用以下命令:

 df.pivot(index='StudentID', columns='QuestionID')

但我收到以下错误:

ReshapeError: Index contains duplicate entries, cannot reshape

请注意,没有最后一行的相同数据框

13       1234        bar      b   2012/11/01

枢轴结果成功如下:

           Answer               DateRecorded                                    
QuestionID    bar baz  bop  foo          bar         baz         bop         foo
StudentID                                                                       
1234            a   b  NaN    c   2012/01/21  2012/04/11         NaN  2012/01/22
4321            a   b    a  NaN   2012/05/04  2012/02/15  2012/01/22         NaN
5678            a   c    c    a   2012/01/24  2012/06/01  2012/03/15  2012/04/01
8765            a   b    b  NaN   2012/05/03  2012/02/13  2012/02/16         NaN

我是旋转的新手,想知道为什么重复的 StudentID、QuestionID 对会导致这个问题?而且,如何使用 df.pivot() 函数解决此问题?

谢谢你。

4

2 回答 2

5

您希望您的数据透视表与重复条目看起来像什么?我不确定在数据透视表中为 (1234, bar) 设置多个元素是否有意义。您的数据看起来像是由(questionID、studentID、dateRecorded)自然索引的。

如果您使用分层索引方法(它们真的没那么复杂!)我会尝试:

In [104]: df2 = df.set_index(['StudentID', 'QuestionID', 'DateRecorded'])

In [105]: df2
Out[105]: 
                                  Answer
StudentID QuestionID DateRecorded       
1234      bar        2012/01/21        a
          foo        2012/01/22        c
4321      bop        2012/01/22        a
5678      bar        2012/01/24        a
8765      baz        2012/02/13        b
4321      baz        2012/02/15        b
8765      bop        2012/02/16        b
5678      bop        2012/03/15        c
          foo        2012/04/01        a
1234      baz        2012/04/11        b
8765      bar        2012/05/03        a
4321      bar        2012/05/04        a
5678      baz        2012/06/01        c
1234      bar        2012/11/01        b

In [106]: df2.unstack('QuestionID')
Out[106]: 
                       Answer               
QuestionID                bar  baz  bop  foo
StudentID DateRecorded                      
1234      2012/01/21        a  NaN  NaN  NaN
          2012/01/22      NaN  NaN  NaN    c
          2012/04/11      NaN    b  NaN  NaN
          2012/11/01        b  NaN  NaN  NaN
4321      2012/01/22      NaN  NaN    a  NaN
          2012/02/15      NaN    b  NaN  NaN
          2012/05/04        a  NaN  NaN  NaN
5678      2012/01/24        a  NaN  NaN  NaN
          2012/03/15      NaN  NaN    c  NaN
          2012/04/01      NaN  NaN  NaN    a
          2012/06/01      NaN    c  NaN  NaN
8765      2012/02/13      NaN    b  NaN  NaN
          2012/02/16      NaN  NaN    b  NaN
          2012/05/03        a  NaN  NaN  NaN

否则,您可以提出一些规则来确定要为数据透视表使用多个条目中的哪一个,并避免使用分层索引。

于 2013-07-31T20:16:13.240 回答
0

除了依赖 Pandas(当然更好),您还可以手动聚合数据。

def heatmap_seaborn():
    na_lr_measures = [50, 50, 50, 49, 49, 49, 48, 47, 47, 47, 46, 46, 46, 46, 45, 45, 45, 45, 45, 45, 45, 45, 45, 43, 43, 43, 43, 42, 42, 42, 41, 41, 41, 41, 41, 41, 41, 40, 40, 40, 40, 40, 40, 40, 40, 39, 39, 37, 37, 36, 36, 36, 36, 35, 35, 35, 35, 35, 34, 34, 34, 33, 33, 33, 32, 32, 31, 30, 30, 30, 29, 29]
    na_lr_labels = ('bi2e', 'bi21', 'bi22', 'si21', 'si22', 'si2e', 'si11', 'bi11', 'bi1e', 'si1e', 'bx21', 'ti22', 'bx2e', 'si12', 'ti1e', 'sx22', 'ti21', 'bx22', 'sx2e', 'bi12', 'ti11', 'sx21', 'ti2e', 'ti12', 'sx11', 'sx1e', 'bxx2', 'bx1e', 'bx11', 'tx2e', 'tx22', 'tx21', 'sx12', 'six1', 'six2', 'sixe', 'sixx', 'tx11', 'bx12', 'bix2', 'bix1', 'tx1e', 'bixe', 'bixx', 'bxxe', 'sxx2', 'tx12', 'tixe', 'tix1', 'sxxe', 'sxx1', 'si1x', 'tixx', 'bxx1', 'tix2', 'bi2x', 'sxxx', 'si2x', 'txx1', 'bxxx', 'txxe', 'ti2x', 'sx2x', 'bx2x', 'txxx', 'bi1x', 'tx1x', 'sx1x', 'tx2x', 'txx2', 'bx1x', 'ti1x')
    na_lr_labelcategories = ["TF", "IDF", "Normalisation", "Regularisation", "Acc@161"]


    measures = na_lr_measures
    labels = na_lr_labels
    cats = na_lr_labelcategories


    new_measures = defaultdict(list)
    new_labels = []
    #cats = ["TF", "Normalisation", "Acc@161"]
    for i,c in enumerate(labels):
        c=c[0]+c[2]
        new_labels.append(c)
        m = measures[i]
        new_measures[c].append(m)
    labels = list(set(new_labels))
    measures = []
    for l in labels:
        m = np.mean(new_measures[l])
        measures.append(m)





    df = pd.DataFrame(
                  {cats[0]:pd.Categorical([a[0] for a in labels]), 
                   #cats[1]:pd.Categorical([a[1] for a in labels]), 
                   cats[2]:pd.Categorical([a[1] for a in labels]), 
                   #cats[3]:pd.Categorical([a[3] for a in labels]), 
                   cats[4]:measures})
    print df


    df = df.pivot(cats[0], cats[2], cats[4])
    sns.set_context("paper",font_scale=2.7)
    fig, ax = plt.subplots()
    ax = sns.heatmap(df)
    plt.show()

正如您在示例中看到的,pandas 数据框是从一些数组构建的,然后手动聚合表。我这样做是因为我没有时间学习更多熊猫。

于 2015-10-12T05:26:31.747 回答