python - Pandas 合并具有不同列的两个数据框

Question

我肯定在这里遗漏了一些简单的东西。尝试在 pandas 中合并两个数据框，它们的列名大多相同，但右侧的数据框有一些左侧没有的列，反之亦然。

>df_may

  id  quantity  attr_1  attr_2
0  1        20       0       1
1  2        23       1       1
2  3        19       1       1
3  4        19       0       0

>df_jun

  id  quantity  attr_1  attr_3
0  5         8       1       0
1  6        13       0       1
2  7        20       1       1
3  8        25       1       1

我尝试使用外部连接加入：

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")

但这会产生：

Left data columns not unique: Index([....

我还指定了要加入的单个列（on = "id"例如），但这会复制除idlikeattr_1_x之外的所有列attr_1_y，这并不理想。我还将列的整个列表（有很多）传递给on：

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))

产生：

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

我错过了什么？我想获得一个附加了所有行的 df ，并且attr_1, attr_2，attr_3在可能的情况下填充，在它们不显示的地方填充 NaN 。这似乎是一个非常典型的数据处理工作流程，但我被困住了。

提前致谢。

score 147 · Accepted Answer

我认为在这种情况下concat是你想要的：

In [12]:

pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
   attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

通过axis=0这里，您将 df 堆叠在一起，我相信这是您想要的，然后NaN在它们各自的 dfs 中不存在它们的地方产生价值。

score 3 · Accepted Answer

如果有重复的标题，接受的答案将中断：

InvalidIndexError：重新索引仅对具有唯一值的索引对象有效。

例如，这里A有 3xtrial列，这可以防止concat：

A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
#    id  trial  trial  trial
# 0   3      1      4      1

B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
#    id  trial
# 0   5      9
# 1   2      6

pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects

要解决此问题，请对之前的列名进行重复数据删除 concat：

parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})

for df in [A, B]:
    df.columns = parser._maybe_dedup_names(df.columns) 

pd.concat([A, B], ignore_index=True)
#    id  trial  trial.1  trial.2
# 0   3      1        4        1
# 1   5      9      NaN      NaN
# 2   2      6      NaN      NaN

或者作为单行但可读性较差：

pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)

请注意，对于 pandas <1.3.0，请使用：parser = pd.io.parsers.ParserBase({})

score 1 · Accepted Answer

我今天使用 concat、append 或 merge 中的任何一个都遇到了这个问题，我通过添加一个按顺序编号的辅助列然后进行外连接来解决这个问题

helper=1
for i in df1.index:
    df1.loc[i,'helper']=helper
    helper=helper+1
for i in df2.index:
    df2.loc[i,'helper']=helper
    helper=helper+1
df1.merge(df2,on='helper',how='outer')

python - Pandas 合并具有不同列的两个数据框

3 回答 3

Related

Reference