1

我有两个要在“Id”上合并的数据框。两个数据帧的大小不一样,其中一个可能会重复Id。我希望在 df1 中为每个 Id 的第一个实例合并列,然后为不匹配的实例附加数据表。

df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 4, 4, 4],
                     "x": [3, 7, 8, 3, 5, 4, 1]})

df2 = pd.DataFrame({"Id": [1, 2, 2, 2, 3, 4, 4],
                     "y": [9, 7, 5, 8, 9, 6, 8]})

预期输出:

df1 = {"Id": [1, 1, 2, 2, 2, 3, 4, 4, 4],
"x": [3, 7, 8, NaN, NaN, 3, 5, 4, 1],
"y": [9, NaN, 7, 5, 8, 9, 6, 8, NaN]} 
}

编辑:将解决方案扩展到我更大的数据集会导致错误:'float' object not iterable。

所有 Dates 列都是 datetime,而 UserId 和 ActivityCount 是 int64。

df = (
    pd.concat(
        [df_emails.groupby("UserId")["Email Dates"].agg(list), df_emails.groupby("UserId")["Click Dates"].agg(list), 
         df_emails.groupby("UserId")["Open Dates"].agg(list), df_activity.groupby("UserId")["Activity Dates"].agg(list), 
         df_activity.groupby("UserId")["ActivityCount"].agg(list)],
        axis=1,
    )
    .reset_index()
    .apply(
        lambda x: [
            (x["UserId"], *t) for t in zip_longest(x["Email Dates"], x["Click Dates"], x["Open Dates"], 
                                                   x["Activity Dates"], x["ActivityCount"], fillvalue=np.nan)
        ],
        axis=1,
    )
    .explode()
    .apply(pd.Series)
    .rename(columns={0: "UserId", 1: "Email Dates", 2: "Click Dates", 3: "Open Dates", 4: "Activity Dates", 5: "ActivityCount"})
)
4

1 回答 1

0

尝试:

from itertools import zip_longest


df = (
    pd.concat(
        [df1.groupby("Id")["x"].agg(list), df2.groupby("Id")["y"].agg(list)],
        axis=1,
    )
    .fillna("")
    .apply(list)
    .reset_index()
    .apply(
        lambda x: [
            (x["Id"], *t) for t in zip_longest(x["x"], x["y"], fillvalue=np.nan)
        ],
        axis=1,
    )
    .explode()
    .apply(pd.Series)
    .rename(columns={0: "Id", 1: "x", 2: "y"})
)
print(df)

印刷:

    Id    x    y
0  1.0  3.0  9.0
0  1.0  7.0  NaN
1  2.0  8.0  7.0
1  2.0  NaN  5.0
1  2.0  NaN  8.0
2  3.0  3.0  9.0
3  4.0  5.0  6.0
3  4.0  4.0  8.0
3  4.0  1.0  NaN
于 2021-04-30T00:32:03.567 回答