0

我正在研究泰坦尼克号数据集作为我的第一个项目。为了估算变量“年龄”的缺失值,我运行了一个线性回归模型。现在,我有 2 个数据框如下 -

train_data.tail()

          Survived  Pclass     Sex   Age  SibSp  Parch   Fare Embarked
    886         0       2    male  27.0      0      0  13.00        S
    887         1       1  female  19.0      0      0  30.00        S
    888         0       3  female   NaN      1      2  23.45        S
    889         1       1    male  26.0      0      0  30.00        C
    890         0       3    male  32.0      0      0   7.75        Q

imp_age.head()

          Age
    859  27.0
    863  -8.0
    868  27.0
    878  27.0
    888  23.0

上面给出的第二个数据帧具有我想要估算的年龄值来代替第一个数据帧的“NaN”值。两个数据框在列名“年龄”下都有这些数据。

我尝试运行以下代码来获取合并的 df -

merged_df = train_data.merge(imp_age,how='outer',left_index=True,right_index=True)

但是输出会创建一个额外的“Age_y”列,而不是将其与旧列合并 -

     Survived  Pclass     Sex  Age_x  SibSp  Parch   Fare Embarked  Age_y
886         0       2    male   27.0      0      0  13.00        S    NaN
887         1       1  female   19.0      0      0  30.00        S    NaN
888         0       3  female    NaN      1      2  23.45        S   23.0
889         1       1    male   26.0      0      0  30.00        C    NaN
890         0       3    male   32.0      0      0   7.75        Q    NaN

有人可以帮我获得以下所需的输出。我在这方面做了很多折腾,但由于我是 Python 新手,所以我有点挣扎 -

      Survived  Pclass     Sex  Age    SibSp  Parch   Fare Embarked  
886         0       2    male   27.0      0      0  13.00        S   
887         1       1  female   19.0      0      0  30.00        S   
888         0       3  female   23.0      1      2  23.45        S   
889         1       1    male   26.0      0      0  30.00        C   
890         0       3    male   32.0      0      0   7.75        Q   
4

1 回答 1

1

尝试填充,

train_data['Age'] = train_data['Age'].fillna(imp_age['Age'])
于 2020-05-24T11:16:14.813 回答