我有一个包含国籍、职业和年龄列的简单数据框。国籍为欧盟、美洲、亚洲的热编码 0,1,2。
对于每个职业,我想找到每个国籍的百分比 例如:67% 的医生是欧洲人,33% 是亚洲人。
import pandas as pd
import numpy as np
#create dataframe
df=pd.DataFrame(np.concatenate((np.random.randint(low=0, high=3, size= (10,1)),np.random.randint(low=24, high=70, size=(10,1))),axis=1))
df.columns=['nationality','age']
df['occupation']=['teacher']*2+['engineer']*3+['doctor']*3+['lawyer']*2
nationality age occupation
0 0 65 teacher
1 0 31 teacher
2 0 30 engineer
3 2 63 engineer
4 0 28 engineer
5 1 27 doctor
6 0 52 doctor
7 0 60 doctor
8 0 33 lawyer
9 0 38 lawyer
df.groupby(['occupation','nationality']).count()
def iseuropean(x):
if x==0:
return 1
else:
return 0
def isamerican(x):
if x==1:
return 1
else:
return 0
def isasian(x):
if x==2:
return 1
else:
return 0
使用 groupby 我可以得到计数,但我想为每个职业组应用一个函数来确定百分比。不过,我一直无法弄清楚。
任何帮助将不胜感激。