使用GoupBy.cumcount
每个客户就诊的不同医院数量的累计计数
import pandas as pd
df = pd.DataFrame({
'record_id': list(range(1,7)),
'client_id':['MK', 'JJ', 'MK', 'JJ', 'MK', 'JJ'],
'date': [20140101, 20160401,20140226,20160501,20140301,20160606],
'hospital': ['1j', '2j', '1j', '2h', '2h', '2j']
})
df.sort_values(by=['client_id', 'date'], inplace=True)
df['hospital_count'] = df.drop_duplicates(subset=['client_id', 'hospital']
).groupby('client_id').cumcount() + 1
df.fillna(method='ffill', inplace=True)
print(df)
# record_id client_id date hospital hospital_count
# 1 2 JJ 20160401 2j 1.0
# 3 4 JJ 20160501 2h 2.0
# 5 6 JJ 20160606 2j 2.0
# 0 1 MK 20140101 1j 1.0
# 2 3 MK 20140226 1j 1.0
# 4 5 MK 20140301 2h 2.0
解释:我们使用 ; 删除同一客户到同一家医院的连续访问drop_duplicates
;然后我们可以简单地使用groupby
和来计算每个客户的访问量cumcount
。但是,这会在被删除的行中留下NaN
值;我们使用 填充这些值fillna
。
每位客户到每家医院的累计就诊次数
import pandas as pd
df = pd.DataFrame({
'record_id': list(range(1,7)),
'client_id':['MK', 'JJ', 'MK', 'JJ', 'MK', 'JJ'],
'date': [20140101, 20160401,20140226,20160501,20140301,20160606],
'hospital': ['1j', '2j', '1j', '2h', '2h', '2j']
})
df['hospital_count'] = df.sort_values(by=['client_id', 'hospital', 'date']
).groupby(['client_id', 'hospital']
).cumcount() + 1
print(df)
# record_id client_id date hospital hospital_count
# 0 1 MK 20140101 1j 1
# 1 2 JJ 20160401 2j 1
# 2 3 MK 20140226 1j 2
# 3 4 JJ 20160501 2h 1
# 4 5 MK 20140301 2h 1
# 5 6 JJ 20160606 2j 2