1

我有一个包含国籍、职业和年龄列的简单数据框。国籍为欧盟、美洲、亚洲的热编码 0,1,2。

对于每个职业,我想找到每个国籍的百分比 例如:67% 的医生是欧洲人,33% 是亚洲人。

import pandas as pd
import numpy as np
#create dataframe
df=pd.DataFrame(np.concatenate((np.random.randint(low=0, high=3, size=   (10,1)),np.random.randint(low=24, high=70, size=(10,1))),axis=1))
df.columns=['nationality','age']
df['occupation']=['teacher']*2+['engineer']*3+['doctor']*3+['lawyer']*2


  nationality   age occupation
0   0   65  teacher
1   0   31  teacher
2   0   30  engineer
3   2   63  engineer
4   0   28  engineer
5   1   27  doctor
6   0   52  doctor
7   0   60  doctor
8   0   33  lawyer
9   0   38  lawyer

df.groupby(['occupation','nationality']).count()

def iseuropean(x):
    if x==0:
        return 1
    else:
        return 0
def isamerican(x):
    if x==1:
        return 1
    else:
        return 0
def isasian(x):
    if x==2:
        return 1
    else:
        return 0

使用 groupby 我可以得到计数,但我想为每个职业组应用一个函数来确定百分比。不过,我一直无法弄清楚。

任何帮助将不胜感激。

4

1 回答 1

2

我假设您正在寻找每个职业的国籍百分比:

In [11]: c = df.groupby(['occupation','nationality'])["age"].count().rename("count")

In [12]: c
Out[12]:
occupation  nationality
doctor      0              2
            1              1
engineer    0              2
            2              1
lawyer      0              2
teacher     0              2
Name: count, dtype: int64

In [13]: c / c.sum()  # proportion of each, maybe not very useful
Out[13]:
occupation  nationality
doctor      0              0.2
            1              0.1
engineer    0              0.2
            2              0.1
lawyer      0              0.2
teacher     0              0.2
Name: count, dtype: float64

In [14]: c / c.groupby(level=0).sum()  # proportion of each occupation
Out[14]:
occupation  nationality
doctor      0              0.666667
            1              0.333333
engineer    0              0.666667
            2              0.333333
lawyer      0              1.000000
teacher     0              1.000000
Name: count, dtype: float64

除了你可能想要使用分类代码而不是 is_XXX:

In [21]: pd.Categorical.from_codes(df.nationality, ["european", "american", "asian"])
Out[21]:
[european, european, european, asian, european, american, european, european, european, european]
Categories (3, object): [european, american, asian]

In [22]: df.nationality = pd.Categorical.from_codes(df.nationality, ["european", "american", "asian"])

In [23]: df
Out[23]:
  nationality  age occupation
0    european   65    teacher
1    european   31    teacher
2    european   30   engineer
3       asian   63   engineer
4    european   28   engineer
5    american   27     doctor
6    european   52     doctor
7    european   60     doctor
8    european   33     lawyer
9    european   38     lawyer
于 2017-11-12T18:16:06.303 回答