0

I have the following dataframe:

data = {
    "date": {
        0: "2019-02-01",
        2: "2019-02-07",
        3: "2019-02-15",
        5: "2019-02-18",
        12: "2019-03-02",
        17: "2019-03-06",
        19: "2019-03-13",
        21: "2019-03-20",
    },
    "date_month_start": {
        0: "2019-02-01",
        2: "2019-02-01",
        3: "2019-02-01",
        5: "2019-02-01",
        12: "2019-03-01",
        17: "2019-03-01",
        19: "2019-03-01",
        21: "2019-03-01",
    },
    "account": {0: 67, 2: 69, 3: 67, 5: 67, 12: 67, 17: 67, 19: 67, 21: 69,},
    "balance": {
        0: 1705.65,
        2: 1929.49,
        3: 2004.46,
        5: 2595.54,
        12: 4428.41,
        17: 2301.5,
        19: 3089.82,
        21: 3141.19,
    },
    "amount": {0: 0, 2: 0, 3: 0, 5: 0, 12: 0, 17: 0, 19: 0, 21: 0},
    "category__name": {
        0: "aaa",
        2: "aaa",
        3: "bbb",
        5: "aaa",
        12: "aaa",
        17: "bbb",
        19: "aaa",
        21: "aaa",
    },
}
df = pd.DataFrame(data)
df["date"] = pd.to_datetime(df["date"])
df["date_month_start"] = pd.to_datetime(df["date_month_start"])
df.sort_values('date', inplace=True)

Which results in:

      date           date_month_start   account   balance   amount   category__name
0     2019-02-01     2019-02-01         67        1705.65   0        aaa
2     2019-02-07     2019-02-01         69        1929.49   0        aaa
3     2019-02-15     2019-02-01         67        2004.46   0        bbb
5     2019-02-18     2019-02-01         67        2595.54   0        aaa
12    2019-03-02     2019-03-01         67        4428.41   0        aaa
17    2019-03-06     2019-03-01         67        2301.50   0        bbb
19    2019-03-13     2019-03-01         67        3089.82   0        aaa
21    2019-03-20     2019-03-01         69        3141.19   0        aaa

I need to determine the first date_month_start for each combination of account plus category__name. Then for each of those groups I need to set the amount of the last row to the balance.

The result will be:

      date           date_month_start   account   balance   amount   category__name
0     2019-02-01     2019-02-01         67        1705.65   0        aaa
2     2019-02-07     2019-02-01         69        1929.49   1929.49  aaa
3     2019-02-15     2019-02-01         67        2004.46   2004.46  bbb
5     2019-02-18     2019-02-01         67        2595.54   2595.54  aaa
12    2019-03-02     2019-03-01         67        4428.41   0        aaa
17    2019-03-06     2019-03-01         67        2301.50   0        bbb
19    2019-03-13     2019-03-01         67        3089.82   0        aaa
21    2019-03-20     2019-03-01         69        3141.19   0        aaa

In other words:

  • The first date_start_month for account = 69 and category__name = aaa is 2019-02-01. Set the amount from the last row of that group to balance, ie: 1929.49
  • The first date_start_month for account = 67 and category__name = bbb is 2019-02-01. Set the amount from the last row of that group to balance, ie: 2004.46
  • The first date_start_month for account = 67 and category__name = aaa is 2019-02-01. Set the amount from the last row of that group to balance, ie: 2595.54

In this case the earliest date_month_start was the same in all cases, but that is not always so.

4

2 回答 2

2

Create 2 masks and chain by & for bitwise AND - first compare first values of date_month_start per groups if match all values and then test last duplicates by multiple columns by DataFrame.duplicated:

mask1 = (df.groupby(['account','category__name'])['date_month_start'].transform('first')
          .eq(df['date_month_start']))
mask2 = df.duplicated(['date_month_start','account','category__name'], keep='last')

df['amount'] = df['amount'].mask(mask1 & ~mask2, df['balance'])
print (df)
         date date_month_start  account  balance   amount category__name
0  2019-02-01       2019-02-01       67  1705.65     0.00            aaa
2  2019-02-07       2019-02-01       69  1929.49  1929.49            aaa
3  2019-02-15       2019-02-01       67  2004.46  2004.46            bbb
5  2019-02-18       2019-02-01       67  2595.54  2595.54            aaa
12 2019-03-02       2019-03-01       67  4428.41     0.00            aaa
17 2019-03-06       2019-03-01       67  2301.50     0.00            bbb
19 2019-03-13       2019-03-01       67  3089.82     0.00            aaa
21 2019-03-20       2019-03-01       69  3141.19     0.00            aaa
于 2020-07-20T08:51:08.517 回答
1

Thy is a quick and dirty implementation, just to get the logic right. It may help someone wrapping his/her mind around such kinds of problems. @jezrael version is the way to go in therm of speed.

for i,g in df.groupby(['account','category__name']): 
    date = g.iloc[0]['date_month_start'] 
    same_start_date = g.loc[g.date_month_start == date] 
    last_balance = same_start_date.balance.values[-1] 
    #print(i, date, last_balance) 
    df.at[same_start_date.index[-1], 'amount'] = last_balance 
                                                                                                                                             
#(67, 'aaa') 2019-02-01 00:00:00 2595.54
#(67, 'bbb') 2019-02-01 00:00:00 2004.46
#(69, 'aaa') 2019-02-01 00:00:00 1929.49

df                                                                                                                                           
         date date_month_start  account  balance  amount category__name
0  2019-02-01       2019-02-01       67  1705.65       0            aaa
2  2019-02-07       2019-02-01       69  1929.49    1929            aaa
3  2019-02-15       2019-02-01       67  2004.46    2004            bbb
5  2019-02-18       2019-02-01       67  2595.54    2595            aaa
12 2019-03-02       2019-03-01       67  4428.41       0            aaa
17 2019-03-06       2019-03-01       67  2301.50       0            bbb
19 2019-03-13       2019-03-01       67  3089.82       0            aaa
21 2019-03-20       2019-03-01       69  3141.19       0            aaa
于 2020-07-20T09:04:11.380 回答