python - 如何在新列中查找重复值和对应值

Question

我有这样的数据框，其中国家和名称对于相同的 ID 是唯一的，必须在新列中。

预期输出：如果重复的值不需要在新列中显示，则可以为空

尝试使用下面的代码，但如果我有 2 列并执行相同的任务，一列可以正常工作。

group = df.groupby('ID')
df1 = group.apply(lambda x:x['COUNTRY'].unique())
df1=df1.apply(pd.Series)

score 2 · Accepted Answer

您可以执行以下操作，

# Create a dataframe where each element is aggregated as list
new_df = df.groupby('ID').agg(lambda x: pd.Series(x).unique().tolist())

# Generate column names to be used after expanding lists
country_cols = ['Country_'+str(i) for i in range(new_df["Country"].str.len().max())]
name_cols = ['Name_'+str(i) for i in range(new_df["Name"].str.len().max())]

# Drop the Country, Name columns from the original and expand Country, Name columns and concat that to the original dataframe, finally do a fillna
df2 = pd.concat(
    [new_df.drop(['Country','Name'], axis=1), 
     pd.DataFrame.from_records(new_df["Country"], columns=country_cols, index=new_df.index),
     pd.DataFrame.from_records(new_df["Name"], columns=name_cols, index=new_df.index)
     ], axis=1
     ).fillna(' ')

score 1 · Accepted Answer

我们可以用一个简单的函数来做到这一点：

def unique_column_unstack(dataframe,agg_columns):
    dfs = []
    for col in agg_columns:
        agg_df = df.groupby('ID')[col].apply(lambda x : pd.Series(x.unique().tolist())).unstack()        
        agg_df.columns = agg_df.columns.map(lambda x : f"{col}_{x+1}")
        dfs.append(agg_df)
    return pd.concat(dfs,axis=1)

new_df = unique_column_unstack(df,['COUNTRY','NAME'])

print(new_df)

       COUNTRY_1 COUNTRY_2 NAME_1 NAME_2
ID                                      
20_001        US        IN    LIZ    LAK
20_002        US       NaN    LIZ   CHRI
20_003        US        EU    LIZ    NaN
20_004        EU       NaN   CHRI    NaN

python - 如何在新列中查找重复值和对应值

2 回答 2

Related

Reference