python - Pandas 数据框的汇总计算

Question

我有一个看起来像底部的 DF（摘录，有 4 个区域，日期每个季度都会扩展）

我想创建一个df（按地区），只有最新日期与前一季度和前一年（同一季度）之间的差异

此时 region 和 Quradate 都是索引。

所以我想要类似的东西（不是很接近）：

(['region'] ['Quradate'][-1:-1])-(['region'] ['Quradate'][-2:-2]) 
& (['region']  ['Quradate'][-1:-1])-(['region'] ['Quradate'][-5:-5])

所以我最终会在每个区域得到两行，第一行与上一季度的分数不同（实际上有 5 个分数），第二行与上一年的差异。

卡住...

                                                                  Score1      Score2  
region                                           Quradate           
North_Central-Birmingham-Tuscaloosa-Anniston 2010-01-15             47           50
                                             2010-04-15             45           60
                                             2010-07-15             45           40
                                             2010-10-15             42           43
                                             2011-01-15             46           44
                                             2011-04-15             45           45
                                             2011-07-15             45           45
                                             2011-10-15             43           46
                                             2012-01-15             51           55
                                             2012-04-15             53           56
                                             2012-07-15             51           57
                                             2012-10-15             52           58
                                             2013-01-15             50           50
                                             2013-04-15             55           55
                                             2013-07-15             55           56
                                             2013-10-15             51           66   
North_Huntsville-Decatur-Florence            2010-01-15             55           55

score 1 · Accepted Answer

请参阅此处以获取解决方案和讨论：使用索引名称通过 Pandas 中的多索引框架选择新数据帧

基本上你所需要的只是与上一时期的差异

df.groupby(level='region').apply(lambda x: x.diff().iloc[-1])

和一年前的差异（4个季度）

df.groupby(level='region').apply(lambda x: x.diff(4).iloc[-1])

score 0 · Accepted Answer

我认为你在某种程度上是在正确的轨道上。在我看来，我会创建一个函数来计算您正在寻找的两个值并返回一个数据框。类似于以下内容：

def find_diffs(region):
    score_cols = ['Score1', 'Score2']

    most_recent_date = region.Quradate.max()
    last_quarter = most_recent_date - datetime.timedelta(365/4) # shift by 4 months
    last_year = most_recent_date - datetime.timedelta(365) # shift by a year

    quarter_score_diff = region[region.Quradate == most_recent_date OR region.Quradate == last_quarter)].diff()
    quarter_score_diff['id'] = 'quarter_diff'

    year_score_diff = region[region.Quradate == most_recent_date OR region.Quradate == last_year)].diff()
    year_score_diff['id'] = 'year_diff'

    df_temp = quarter_score_diff.append(year_score_diff)
    return df_temp

那么你也能：

DF.groupby(['region']).apply(find_diffs)

结果将是一个按区域索引的 DF，其中包含每个分数差异的列和一个将每一行标识为季度或年度差异的附加列。

score 0 · Accepted Answer

编写一个函数然后与 groupby 一起使用绝对是一种选择，另一件容易做的事情是列出组中的数据并使用索引进行计算，这可能是由于您的规则间隔性质数据（请记住，这仅在数据定期间隔时才有效）。这种方法完全不必真正处理日期。首先，我将重新索引，以便该区域作为列出现在数据框中，然后我将执行以下操作：

#First I create some data
Dates = pd.date_range('2010-1-1', periods = 14, freq = 'Q')
Regions = ['Western', 'Eastern', 'Southern', 'Norhtern']
df = DataFrame({'Regions': [elem for elem in Regions for x in range(14)], \
            'Score1' : np.random.rand(56), 'Score2' : np.random.rand(56), 'Score3' : np.random.rand(56), \
            'Score4' : np.random.rand(56), 'Score5' : np.random.rand(56)}, index = list(Dates)*4)

# Create a dictionary to hold your data
SCORES = ['Score1', 'Score2', 'Score3', 'Score4', 'Score5']
ValuesDict = {region : {score : [int(), int()] for score in SCORES} for region in df.Regions.unique()}

#This dictionary will contain keys that are your regions, and these will correspond to a dictionary that has keys that are your scores and those correspond to a list of which the fisrt element is the most recent - last quarter calculation, and the second is the most recent - last year calcuation. 

#Now group the data
dfGrouped = df.groupby('Regions')

#Now iterate through the groups creating lists of the underlying data. The data that is at the last index point of the list is by definition the newest (due to the sorting when grouping) and the obervation one year previous to that is - 4 index points away.

for group in dfGrouped:
    Score1List = list(group[1].Score1)
    Score2List = list(group[1].Score2)
    Score3List = list(group[1].Score3)
    Score4List = list(group[1].Score4)
    Score5List = list(group[1].Score5)
    MasterList = [Score1List, Score2List, Score3List, Score4List, Score5List]
    for x in xrange(1, 6):
        ValuesDict[group[0]]['Score' + str(x)][0] = MasterList[x-1][-1] - MasterList[x-1][-2]
        ValuesDict[group[0]]['Score' + str(x)][1] = MasterList[x-1][-1] - MasterList[x-1][-5]

ValuesDict

它有点令人费解，但这是我经常处理这类问题的方式。Values dict 包含您需要的所有数据，但我很难将其放入数据框中。

python - Pandas 数据框的汇总计算

3 回答 3

Related

Reference