我正在研究泰坦尼克号数据集作为我的第一个项目。为了估算变量“年龄”的缺失值,我运行了一个线性回归模型。现在,我有 2 个数据框如下 -
train_data.tail()
Survived Pclass Sex Age SibSp Parch Fare Embarked
886 0 2 male 27.0 0 0 13.00 S
887 1 1 female 19.0 0 0 30.00 S
888 0 3 female NaN 1 2 23.45 S
889 1 1 male 26.0 0 0 30.00 C
890 0 3 male 32.0 0 0 7.75 Q
imp_age.head()
Age
859 27.0
863 -8.0
868 27.0
878 27.0
888 23.0
上面给出的第二个数据帧具有我想要估算的年龄值来代替第一个数据帧的“NaN”值。两个数据框在列名“年龄”下都有这些数据。
我尝试运行以下代码来获取合并的 df -
merged_df = train_data.merge(imp_age,how='outer',left_index=True,right_index=True)
但是输出会创建一个额外的“Age_y”列,而不是将其与旧列合并 -
Survived Pclass Sex Age_x SibSp Parch Fare Embarked Age_y
886 0 2 male 27.0 0 0 13.00 S NaN
887 1 1 female 19.0 0 0 30.00 S NaN
888 0 3 female NaN 1 2 23.45 S 23.0
889 1 1 male 26.0 0 0 30.00 C NaN
890 0 3 male 32.0 0 0 7.75 Q NaN
有人可以帮我获得以下所需的输出。我在这方面做了很多折腾,但由于我是 Python 新手,所以我有点挣扎 -
Survived Pclass Sex Age SibSp Parch Fare Embarked
886 0 2 male 27.0 0 0 13.00 S
887 1 1 female 19.0 0 0 30.00 S
888 0 3 female 23.0 1 2 23.45 S
889 1 1 male 26.0 0 0 30.00 C
890 0 3 male 32.0 0 0 7.75 Q