python - 合并两个数据框（都具有多索引）

Question

我有一个类似于Merge two dataframes with multi-index的问题。</p>

在：

import pandas as pd
import numpy as np
row_x1 = ['a1','b1','c1']
row_x2 = ['a2','b2','c2']
row_x3 = ['a3','b3','c3']
row_x4 = ['a4','b4','c4']
index_arrays = [np.array(['first', 'first', 'second', 'second']), np.array(['one','two','one','two'])]
df1 = pd.DataFrame([row_x1,row_x2,row_x3,row_x4], columns=list('ABC'), index=index_arrays)
print(df1)

出去：

             A   B   C
first  one  a1  b1  c1
       two  a2  b2  c2
second one  a3  b3  c3
       two  a4  b4  c4

在：

row_y1 = ['d1','e1','f1']
row_y2 = ['d2','e2','f2']
row_y3 = ['d3','e3','f3']
index_arrays = [np.array(['first','first', 'second',]), np.array(['one','three','two'])]
df2 = pd.DataFrame([row_y1,row_y2,row_y3], columns=list('DEF'), index=index_arrays)
print(df2)

出去：

               D   E   F
first  one    d1  e1  f1
       three  d2  e2  f2
second two    d3  e3  f3

换句话说，我如何合并它们以实现 df3（如下）？

在：

row_x1 = ['a1','b1','c1']
row_x2 = ['a2','b2','c2']
row_x3 = ['a3','b3','c3']
row_x4 = ['a4','b4','c4']
row_y1 = ['d1','e1','f1']
row_y2 = ['d2','e2','f2']
row_y3 = ['d3','e3','f3']

row_z1 = row_x1 + row_y1
row_z2 = row_x2 + [np.nan, np.nan, np.nan]
row_z3 = [np.nan, np.nan, np.nan] + row_y2
row_z4 = row_x3 + [np.nan, np.nan, np.nan]
row_z5 = row_x4 + row_y3
index_arrays = [np.array(['first', 'first', 'first', 'second', 'second']), np.array(['one','two','three','one','two'])]
df3 = pd.DataFrame([row_z1,row_z2,row_z3,row_z4,row_z5], columns=list('ABCDEF'), index=index_arrays)
print(df3)

出去：

                A    B    C    D    E    F
first  one     a1   b1   c1   d1   e1   f1
       two     a2   b2   c2  NaN  NaN  NaN
       three  NaN  NaN  NaN   d2   e2   f2
second one     a3   b3   c3  NaN  NaN  NaN
       two     a4   b4   c4   d3   e3   f3

PS。感谢@Andreuccio 的问题！

感谢@Ajay Verma 和@EBDS。这确实是手动创建 df 数据的解决方案。但我对以下情况感到非常困惑：

我有两个来自统计数据的数据框。然后我为 pd.merge() 复制了相应的数据

在：

df1 = data1[data1.index.get_level_values(0) == 'BASIC_GZAG_TMB'].copy()

出去：

                         0       1       2       3
BASIC_GZAG_TMB 1     127.0   179.0   190.0   239.0
               2      38.0    23.0    21.0    29.0
               3      37.0    27.0    32.0    37.0
               4       5.0    14.0    11.0    23.0
               5      31.0    56.0    41.0    65.0
               7     389.0   258.0   337.0   243.0
               NaN  1323.0  1388.0  1307.0  1311.0

在：

df2 = data2[data2.index.get_level_values(0) == 'BASIC_GZAG_TMB'].copy()

出去：

                         0       1       2       3
BASIC_GZAG_TMB 1     207.0   232.0   252.0   223.0
               2      26.0    18.0    19.0    20.0
               3      43.0    41.0    50.0    42.0
               4      35.0    27.0    37.0    15.0
               5      54.0    52.0    78.0    64.0
               6       1.0  1306.0     1.0     4.0
               7     206.0   263.0   227.0   230.0
               NaN  1374.0  1306.0  1282.0  1348.0

然后我通过以下方式合并 df1 和 df2：

df1.merge(df2, left_index=True, right_index=True, how='outer')

出去：

                       0_x     1_x     2_x     3_x     0_y     1_y     2_y  \
BASIC_GZAG_TMB 1     127.0   179.0   190.0   239.0   207.0   232.0   252.0   
               2      38.0    23.0    21.0    29.0    26.0    18.0    19.0   
               3      37.0    27.0    32.0    37.0    43.0    41.0    50.0   
               4       5.0    14.0    11.0    23.0    35.0    27.0    37.0   
               5      31.0    56.0    41.0    65.0    54.0    52.0    78.0   
               7     389.0   258.0   337.0   243.0   206.0   263.0   227.0   
               NaN  1323.0  1388.0  1307.0  1311.0  1374.0  1306.0  1282.0   

                       3_y  
BASIC_GZAG_TMB 1     223.0  
               2      20.0  
               3      42.0  
               4      15.0  
               5      64.0  
               7     230.0  
               NaN  1348.0

我对 df2 中存在的 6 的索引在结果中消失了感到困惑。

我知道如果我使用 df2.merge(df1...) 可以成为一个解决方案。但其实data1和data2是动态生成的，不知道哪个有更多的索引。我只想得到 df1 和 df2 的联合。

score 0 · Accepted Answer

你可以使用 Pandas merge。文档链接：link

df = df1.merge(df2, left_index=True, right_index=True, how='outer')
print(df)

输出

                A    B    C    D    E    F
first  one     a1   b1   c1   d1   e1   f1
       three  NaN  NaN  NaN   d2   e2   f2
       two     a2   b2   c2  NaN  NaN  NaN
second one     a3   b3   c3  NaN  NaN  NaN
       two     a4   b4   c4   d3   e3   f3

score 0 · Accepted Answer

如果需要按数字词排序……一、二、三……

使用 pandas 合并命令
使用 key= vectorize parse 对索引进行排序

代码：

from number_parser import parse
dfx = (
    df1.merge(df2,left_index=True,right_index=True,how='outer')
    .sort_index(key=lambda x: np.vectorize(parse)(x).astype(float)) )

另一个例子：

您可能需要安装 number_parse：

!pip install number_parser

更新：

由于我没有新数据，所以我使用原始数据来测试“丢失的 6”。我还将列名更改为相同，并添加了一个 nan 索引。

data1 = df1.copy(deep=True)
data2 = df2.copy(deep=True)
df1 = data1[data1.index.get_level_values(0) == 'first'].copy()
df2 = data2[data2.index.get_level_values(0) == 'first'].copy()

dfx = df1.merge(df2, left_index=True, right_index=True, how='outer').sort_index(
        key=lambda x: np.vectorize(parse)(x)
        )

如您所见，它没有丢失任何值。问题可能不在于合并部分，需要检查导致这种情况的源数据。

python - 合并两个数据框（都具有多索引）

2 回答 2

Related

Reference