python - Pandas groupby 和对数据集的判断

Question

我有一个数据框，其中某些行被分类为“通过”或“失败”。我试图根据项目通过/失败的次数对项目做出总体判断。

熊猫 23.4 版

给定以下df：

*注意：存在其他几列，但为此目的，只有这两列很重要

Name    Judgement
A        Pass
A        Fail
A        Fail
A        Pass
X        Pass
X        Pass
Z        Pass
Z        Pass
Z        Fail
F        Pass

为了做出总体判断，我们查看每个项目通过/失败的次数。出现两次以上的项目只有在（# of pass == # of fail）的情况下才能判断为“总体通过”。曾经发生的项目无需进一步判断。

以下输出：

Name    Judgement
A        Pass
X        Pass
Z        Fail
F        Pass

注意A通过，因为它有 2 个通过和 2 个失败，所以 2/2 = 1 ==通过

Z失败，因为它有 2 次通过和 1 次失败，所以 2/1 = 2 ==失败

我的想法：

df['Name']在加入的同时进行 groupbyJudgement并简单地计算每个名称的每种判断类型出现的次数。有没有更清洁的方法来做到这一点？这个想法似乎有点麻烦，但这是我能想到的。

score 2 · Accepted Answer

这是我的方法：

new_df = df.Judgement.eq('Pass').groupby(df['Name']).agg({'size','mean', 'max'})

is_passed = ( # check those with more than two counts
             (new_df['mean'].eq(0.5) & new_df['size'].gt(2)) 

              # those with one or two counts pass if they have a pass
             | (new_df['size'].le(2) & new_df['max'])   
            )

产生：

Name
A     True
F     True
X     True
Z    False
dtype: bool

等效地，我们可以这样做：

is_passed = np.where(new_df['size'].le(2), new_df['max'] , new_df['mean'].eq(0.5))

你可以np.where用来掩盖pass，fail：

np.where(is_passed, 'pass', 'fail')

score 2 · Accepted Answer

这是你需要的吗？0.5 表示它们相等，1 表示所有项目均通过，这两个条件产生通过

s=df.Judgement.eq('Pass').groupby(df['Name']).agg(['mean','count'])
((s['mean'].eq(1)&s['count'].le(2))|s['mean'].eq(0.5)).map({True:'Pass',False:'Fail'})
Out[436]: 
Name
A    Pass
F    Pass
X    Pass
Z    Fail
dtype: object

score 1 · Accepted Answer

具有自定义apply功能：

In [334]: def compare_pass_fail(x):
     ...:     v_counts = x['Judgement'].value_counts()
     ...:     return 'Pass' if ('Fail' not in v_counts or v_counts.get('Pass') == v_counts['Fail']) else 'Fail'
     ...: 
In [335]: df.groupby('Name').apply(compare_pass_fail)
Out[335]: 
Name
A    Pass
F    Pass
X    Pass
Z    Fail
dtype: object

score 1 · Accepted Answer

我使用了熊猫 groupby 应用功能。逻辑可能会有所不同，但它适用于您的情况。

   df = pd.DataFrame({"Name": ["A","A","A","A","X","X","Z","Z","Z","F"], "Judgement" : ["Pass","Fail","Fail","Pass","Pass","Pass","Pass","Pass","Fail","Pass"]})   



  Name  Judgement
0   A   Pass
1   A   Fail
2   A   Fail
3   A   Pass
4   X   Pass
5   X   Pass
6   Z   Pass
7   Z   Pass
8   Z   Fail
9   F   Pass

def func(x):
    np = len(x[x["Judgement"] == "Pass"])
    nf = len(x[x["Judgement"] == "Fail"])
    if(np*nf == 0):
        return x["Judgement"].unique()[0]
    else:
        if(np!=nf):
            return "Fail"
        else:
            return "Pass"
df.groupby("Name").apply(func)

Name
A    Pass
F    Pass
X    Pass
Z    Fail
dtype: object

score 0 · Accepted Answer

You can also generate a DataFrame first with the pass-fail counts and work on that:

df_count= df.groupby(['Name', 'Judgement']).apply(len).unstack(-1).fillna(0)

And then work on it's columns:

((df_count['Fail'] == df_count['Pass']) | ((df_count['Fail'] == 0) & (df_count['Pass'].le(2)))).map({True: 'Pass', False: 'Fail'})

The overall result is:

Name
A    Pass
F    Pass
X    Pass
Z    Fail
dtype: object

df_count can be used to check the result and looks like this:

Judgement  Fail  Pass
Name                 
A           2.0   2.0
F           0.0   1.0
X           0.0   2.0
Z           1.0   2.0

python - Pandas groupby 和对数据集的判断

5 回答 5

Related

Reference