第一步,您可以使用以下代码:读取数据框
import pandas as pd
df = pd.read_csv("your_df.csv")
我的示例数据框如下:
Pub.Dates Type Visits
0 2019-12-10 00:00:00 A 1000
1 2019-12-15 00:00:00 A 5000
2 2018-06-10 00:00:00 B 6000
3 2018-03-04 00:00:00 B 12000
4 2019-02-10 00:00:00 A 3000
规范化日期:首先定义一个方法来规范化一个日期:
from datetime import datetime
def normalize_date(date): # input: '2019-12-10 00:00:00'
date_obj = datetime.strptime(date,"%Y-%m-%d %H:%M:%S") # get datetime object
date_to_str = date_obj.strftime("%B %Y") # 'December 2019'
diff_date = datetime.now() - date_obj # find diff from today
diff_month = int(diff_date.days / 30) # convert days to month
normalized_value = date_to_str + ", " + str(diff_month) + " months"
return normalized_value # 'December 2019, 9 months'
现在将上述方法应用于日期列的所有值:
df['Pub.Dates'] =list(map(lambda x: normalize_date(x), df["Pub.Dates"].values))
标准化的数据框将是:
Pub.Dates Type Visits
0 December 2019, 9 months A 1000
1 December 2019, 9 months A 5000
2 June 2018, 27 months B 6000
3 March 2018, 31 months B 12000
4 February 2019, 19 months A 3000
5 July 2020, 2 months C 9000
但是对于第二步,如果每个月有多条记录,您可以执行以下步骤,groupby
日期和您需要的其他列,然后获取它们的平均值:
average_in_visits = df.groupby(("Pub.Dates", "Type")).mean()
结果将是:
Visits
Pub.Dates Type
December 2019, 9 months A 3000
February 2019, 19 months A 3000
July 2020, 2 months C 9000
June 2018, 27 months B 6000
March 2018, 31 months B 12000